Random crashes



  • I have installed 2.3.4 and the upgraded to 2.3.4-p1 on a Shuttle DS68U:

    Celeron 3855U
    8GB DDR3
    Intel i211 (igb) and i219m (em) Nics
    128GB nvme drive

    Everything is working, but I started experiencing random crashes and reboots on average once a day.

    There is nothing in /var/crash and other logs don't show any error. Other than that I am running the same packages I was running on my previous hardware (an old Dell laptop) which never crashed:

    Acme
    Avahi
    ntopng
    openvpn-client-export
    pfBlockerNG
    Service Watchdog
    Status traffic Totals
    System patches

    The 128GB nvme m2 drive initially caused the automatic installer to fail , but I managed to get around and complete the installation. Could the nvme compatibility cause the instability? I am tempted to remove the nvme drive and try a fresh install on a regular HD using the same configuration.



  • Removed the nvme drive, installed a 2.5 standard HD, reinstalled 2.3.4, updated it to 2.3.4-p1 and installed the same packages. After restoring the config file everything worked well as expected, but this morning I checked the uptime and it looks like it rebooted itself again around 4.40am.

    It always happens within  36hrs and again I can't see anything in the logs :(

    I'm going to remove packages one at the time and see what happens. Removed ntopng and manually rebooted. Next step will be disable the traffic shaper and after that overnight memtest on the box.



  • After more troubleshooting and keeping a serial console always connected I was able to determine that the crashes are caused by the HD getting detached for some unknown reason:

    ada0 at ahcich0 bus 0 scbus0 target 0 lun 0
    ada0: <st9160310as de06=""> s/n 5GV6HR23D detached
    (ada0:ahcich0:0:0:0): Periph destroyed
    /: got error 6 while accessing filesystem</st9160310as>
    

    I could try a different HD, but I think the current one is good as it was working just fine on my previous pfSense box (laptop). Also, I don't know for sure if the same thing was happening with the nvme drive since I did not have a console connected at the time.

    What I find interesting is that it always happens between 24 and 28 hours of uptime. Once 24 have passed I know that a crash is imminent.



  • Just happened again, after about 26 hours, with the same storage detached issue.
    Is there anything at the OS level that would cause the storage to get detached maybe after inactivity?
    This is also a Skylake based CPU/chip set, are there any known compatibility issues?

    I'll try a different HD, but I'm not confident it will change anything. After that I might consider giving 2.4 a try… I'm really puzzled, because when it works it works very well... So frustrating



  • Unfortunately I have to report that things have not improved for me. I tried two standard HD, switched to 2.4 and tried the NVME drive in efi mode which, at least now, is supported by the installer. I replaced the ram with specific brand and model on the DS68U compatibility list, even if the existing one passed memtest86 with flying colors.

    Nothing! every day, between 24hrs and 31hrs (new record) of uptime the box just spontaneously reboots. It can be in the middle of the night or any time during the day, but never before reaching at least 24 hrs of uptime. When it runs, it works great, performance is good, load low, temps stable around 34C, SMART reporting all good. I am at a loss…

    I will have to schedule a daily reboot at night with cron at this point I don't know what else to do.



  • Did you ever figure this out? i converted my old skylake based server into a router and I'm having the same issue. I switched the ssd drive out thinking it was it. Changed the bios sata controller to ide mode and it seemed stable for a long time(a full month without demounting the drive) It seems totally random, sometimes I can't go a few hours without having an issue. Going to try a fresh install with a usb drive as the mount instead tomorrow.


  • Netgate Administrator

    You see a crash report?

    Anything on the console?

    If it reboots at random with nothing logged it's almost certainly hardware.

    Steve



  • @stephenw10

    it's the pref destroyed error with the ssd dismounting. I'm running a USB drive install since this morning without any issues so far.


  • Netgate Administrator

    Be sure you don't have SWAP (or at least are not swapping) and have moved /var /tmp to RAM if running from flash.

    Also check the root is mounted noatime.

    [2.4.4-RELEASE][admin@fw1.stevew.lan]/root: mount -p
    /dev/diskid/DISK-9E18E959s2a /			ufs	rw,noatime 	1 1
    devfs			/dev			devfs	rw		0 0
    /dev/diskid/DISK-9E18E959s1 /boot/u-boot		msdosfs	rw,noatime 	0 0
    /dev/md0		/tmp			ufs	rw		2 2
    /dev/md1		/var			ufs	rw		2 2
    devfs			/var/dhcpd/dev		devfs	rw		0 0
    

    Steve



  • @stephenw10 said in Random crashes:

    noatime

    /dev/gptid/ba785815-9ce5-11e9-8bc0-90e2ba09f08c /			ufs	rw		1 1
    devfs			/dev			devfs	rw		0 0
    /dev/md0		/tmp			ufs	rw		2 2
    /dev/md1		/var			ufs	rw		2 2
    devfs			/var/dhcpd/dev		devfs	rw		0 0
    

    it seems swap is enabled. I was following this tutorial on disabling "Swap" https://forum.netgate.com/topic/107375/howto-remove-swap-post-install-and-resize/2

    /dev/gptid/ba785815-9ce5-11e9-8bc0-90e2ba09f08c	/	ufs	rw	1	1
    #/dev/gptid/ba7d42de-9ce5-11e9-8bc0-90e2ba09f08c	none	swap	sw	0	0
    
    

    Should I change the "ba785815-9ce5-11e9-8bc0-90e2ba09f08c" to something else before rebooting? I'm pretty sure it's good but i wanted to double check before rebooting.


  • Netgate Administrator

    Hmm, I've never tried that. I would backup the config re-install, remove the swap during the install.

    You should edit the fstab to set it to mount root noatime though if it's not already.

    Steve


Log in to reply