How to automate fsck? (SG-2440)



  • So, I rebooted an SG-2440 at a remote site, and it didn't come back up.

    I went over there, plugged in the console cable, pressed <enter>and got a #

    Stupidly, instead of poking around, I typed "exit", and it immediately booted, complaining about some fsck fixes it had to do.
    Thus I don't know what kind of shell I was in, or why. What was printed to the console before I connected is lost.

    Then it gave me a ton of lines like these:
    –---------------------------------------------------------------
    281.465158 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.476184 [ 799] generic_netmap_dtor      Restored native NA 0
    281.483175 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    281.490567 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.497547 [ 799] generic_netmap_dtor      Restored native NA 0
    281.504807 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    281.512232 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    done.
    281.519241 [ 799] generic_netmap_dtor      Restored native NA 0
    281.526864 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    281.534269 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.541352 [ 799] generic_netmap_dtor      Restored native NA 0
    281.548217 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    281.555776 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.562758 [ 799] generic_netmap_dtor      Restored native NA 0
    281.569795 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    281.577263 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.584263 [ 799] generic_netmap_dtor      Restored native NA 0
    281.595263 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    281.603180 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.610788 [ 799] generic_netmap_dtor      Restored native NA 0
    Starting NTP tim281.618288 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    e client…281.627131 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.635177 [ 799] generic_netmap_dtor      Restored native NA 0
    281.642505 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    281.650094 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.657131 [ 799] generic_netmap_dtor      Restored native NA 0
    281.664235 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    281.671654 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.678689 [ 799] generic_netmap_dtor      Restored native NA 0
    281.685705 [ 266] generic_find_num_desc    called, in tx 1024 rx 1024
    281.693152 [ 274] generic_find_num_queues  called, in txq 0 rxq 0
    281.702990 [ 799] generic_netmap_dtor      Restored native NA 0
    done.
    Starting DHCP service…done.
    Configuring firewall.....0 addresses deleted.
    0 addresses deleted.
    .done.
    Generating RRD graphs...done.
    Starting syslog...done.
    [boot process continues…..]
    –---------------------------------------------------------------

    Well, the box is up now, but what the hey?
    Is it normal for these to get stuck at fsck and require manual intervention?
    What do all the generic_ lines mean?</enter>


  • Banned

    Wait for 2.4 (or use the snapshots) and switch to ZFS. UFS and its fsck is totally broken and unfixable. Trying to fix things with fsck will eventually destroy the filesystem.



  • Guessing that'll mean a reinstall, then…
    Any idea about the generic_ lines?


  • Banned

    Yes, reinstall is the only way to fix UFS.

    I've filed multitude of bugs about UFS and fsck. fsck is so broken that it needs multiple successive manual runs to even try to repair the filesystem, and then it gets all sort of things wrong, and segfaults, or spits out various confused nonsense, and eventually screws the filesystem to the point where you cannot boot any more.

    I got the below patch from one of the pfSense devs for debugging, and while it tries to run fsck much aggressively, as noted above, the only result in the end was complete FS destruction. Also, it would need updating for 2.3.2 or newer, apparently.

    
    diff --git a/src/etc/rc b/src/etc/rc
    index e82a5ba..970fa9c 100755
    --- a/src/etc/rc
    +++ b/src/etc/rc
    @@ -54,7 +54,7 @@ fi
    
     if [ -e /root/force_fsck ]; then
     	echo "Forcing filesystem(s) check..."
    -	/sbin/fsck -y -F -t ufs
    +	/sbin/fsck -y
     fi
    
     if [ "${PLATFORM}" != "cdrom" ]; then
    @@ -77,18 +77,37 @@ if [ "${PLATFORM}" != "cdrom" ]; then
    
     	if [ ${FSCK_ACTION_NEEDED} = 1 ]; then
     		echo "WARNING: Trying to recover filesystem from inconsistency..."
    -		/sbin/fsck -yF
    +		ntries=0
    +		fsck_rc=1
    +		until [ $ntries -ge 3 -o $fsck_rc -eq 0 ]; do
    +			/sbin/fsck -y
    +			fsck_rc=$?
    +			ntries=$((ntries+1))
    +			echo "DEBUG: Run #${ntries} - rc = ${fsck_rc}"
    +			sleep 1
    +
    +			# Sometimes first call returns 0 but filesystem is still broken
    +			# Run fsck in preen mode again just to be sure
    +			/sbin/fsck -p -F
    +			fsck_rc=$?
    +			echo "DEBUG: (-p) #${ntries} - rc = ${fsck_rc}"
    +			sleep 1
    +		done
    +
    +		if [ $fsck_rc -ne 0 ]; then
    +			echo "Automatic filesystem recovery failed. Starting recovery shell!"
    +			tcsh
    +			reboot
    +		fi
     	fi
    
     	/sbin/mount -a 2>/dev/null
    -	mount_rc=$?
    -	attempts=0
    -	while [ ${mount_rc} -ne 0 -a ${attempts} -lt 3 ]; do
    -		/sbin/fsck -yF
    -		/sbin/mount -a 2>/dev/null
    -		mount_rc=$?
    -		attempts=$((attempts+1))
    -	done
    +
    +	if [ $? -ne 0 ]; then
    +		echo "Filesystems could not be mounted. Starting recovery shell!"
    +		tcsh
    +		reboot
    +	fi
    
     	if [ "${PLATFORM}" = "nanobsd" ]; then
     		# XXX This script does need all filesystems rw!!!!