Terribly slow boot times and frequent boot freezes
-
Hey there! I run PFSense on a Topton 4x 2.5Gb ethernet mini computer from AliExpress and it's been pretty stable for pretty long, save some early issues with setup and interfaces. But I just left the house for a few days and saw my power go out for longer than my UPS could manage which resulted in my router going offline and it seems when it came back up when power was restored the boot failed. Once I got home and was able to reboot I saw it hung most times on a series of issues, but frequently:
da0 at umass-simo bus a scbusi target a lun a da0: ‹Generic STORAGE DEVICE 1404> Removable Direct Access SPC-4 SCSI device da0: 40.000MB/s transfers da0: Attempt to query device size failed: NOT READY, Medium not present
or what looks like a SCSI/USB error. I'm running my boot/pfsense disk on a NVME drive which has passed it's S.M.A.R.T. tests and I've shut off USB drive booting in the BIOS.
Sometimes (after rebooting multiple times) I can get it past this point, but then it takes AGES to boot-something like 12-15 minutes before I can get to the web page. It hangs on 'Synchronizing user settings' and at several other points, such as 'Starting Web Configurator' and then all the installed packages. I've gone back and tried to remove as many unused packages as I can, and it seems like it's running stably now, but clearly it's not safe to survive a reboot and I'd like to understand what's causing the boot failure and how to avoid it in future.
-
You use a UPS. Is it connected by a serial or USB plug and is pfSense set up to use the UPS ?
If so, when the battery starts to become empty, pfSense will power down itself the clean way.
Exactly like you shut it down in the GUI with :This way of shutting down is somewhat mandatory.
You (or your power company) can't remove the power to shut it down. Every device with a complex OS and file system (PC, phone, etc) needs a controlled shut down. Changes are that your device won't boot again, because the file system (structure) got damaged.Connect to the console (HDMI, or serial USB - not SSH !) and do this :
How to Run a pfSense Software File System Check (5/2020).That said, this message
da0: Attempt to query device size failed: NOT READY, Medium not present
might also be an indication that the drive is somewhat EOL.
The next time it does boot : take (export) a backup of your pfSense settings !!
Then get a screw driver, open up the ali device, locate the drive, get a new, identical (no : same size but bigger ^^ as they cost close to notting these days) and order a new one.Don't worry, it's 2025, drives still do die all the time
(
)
What is your pfSense version ?
-
@Gertjan PFSense version is 24.11. I am running a UPS but it's not plugged into the router, rather into my unRAID server as it offers better management, with the router just set to auto-start after power failure.
I am willing to believe it could be a slowly failing/bad drive, but it's not showing any SMART issues and is barely 2 years old:
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSEDSMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 94%
Data Units Read: 3,947,408 [2.02 TB]
Data Units Written: 619,185,433 [317 TB]
Host Read Commands: 52,825,492
Host Write Commands: 7,555,637,804
Controller Busy Time: 31,838
Power Cycles: 64
Power On Hours: 19,611
Unsafe Shutdowns: 53
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 341
Critical Comp. Temperature Time: 0Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors LoggedSelf-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
Num Test_Description Status Power_on_Hours Failing_LBA NSID Seg SCT Code
0 Short Completed without error 19602 - - - - -
1 Extended Aborted: Self-test command 19602 - - - - -
2 Extended Completed without error 19602 - - - - - -
@rmeskill said in Terribly slow boot times and frequent boot freezes:
rather into my unRAID server as it offers better management
In that case, I'm pretty sure you could, use the pfSense UPS package so it connects to another UPS server available on your LAN : the unRAID UPS server.
The pfSense UPS software will now be a client, and have indirectly access to the state of the UPS connected to the unRAID so it can do a clean power down when needed.I insist on using some UPS protection, as a sudden power loss of pfSense isn't the end of the world, in a worse case scenario, you re install pfSense 'clean' with the installer, or from an USB drive, import the config, and you're back on line again.
On the other hand, it's always a nasty solution when you main Internet access goes down ...@rmeskill said in Terribly slow boot times and frequent boot freezes:
but it's not showing any SMART issues and is barely 2 years old
Power On Hours: 19,611 / 24 hours / 365 days = 2,2 years old.
It should not show, after several seconds after boot up :@rmeskill said in Terribly slow boot times and frequent boot freezes:
da0 at umass-simo bus a scbusi target a lun a
da0: ‹Generic STORAGE DEVICE 1404> Removable Direct Access SPC-4 SCSI device
da0: 40.000MB/s transfers
da0: Attempt to query device size failed: NOT READY, Medium not presentthe drive / device (this is the boot drive, right, not some other drive ?) is detected alright.
But the OS asked a question, and it doesn't answer with the "wait a moment, not ready yet", but a, imho, more scary : "NOT READY, Medium not present". Afaik, the 'medium' can't be removed ^^I see this when I boot mine :
nvme0: Allocated 16MB host memory buffer mmcsd0: 16GB <MMCHC TB2916 9.0 SN 51891D3E MFG 11/2021 by 112 0x0000> at mmc0 50.0MHz/8bit/65535-block mmcsd0boot0: 4MB partition 1 at mmcsd0 mmcsd0boot1: 4MB partition 2 at mmcsd0 mmcsd0rpmb: 4MB partition 3 at mmcsd0 Trying to mount root from zfs:pfSense/ROOT/24.11-Relase []... Root mount waiting for: CAM Root mount waiting for: CAM Root mount waiting for: CAM nda0 at nvme0 bus 0 scbus0 target 0 lun 1 nda0: <M.2 (P80) 3TE6 V20B09 YCA12111250120759> nda0: Serial Number YCA12111250120759 nda0: nvme version 1.3 nda0: 114473MB (234441648 512 byte sectors)
Note : I see a 16 Gbytes nvm drive ... and I don't use that drive, as my "4100 max "has an 114 Gbytes SSD drive, which handles the "writes" a way better over time.
You use the ZFS file system ?
@rmeskill said in Terribly slow boot times and frequent boot freezes:
Data Units Written: 619,185,433 [317 TB]
oh .... 317 T !
-
@Gertjan well this just raises a bunch of questions for me:
- Is ZFS bad? I didn't explicitly choose it, it just seems it was installed that way
- is running NVMe bad? My Topton box is fairly small and doesn't have internal room for anything other than a NVMe drive, I think? I might be able to put a SSD in instead if that should be better?
- is 317TB a lot? I don't know what's writing so much if so...
- do we think there's a chance this drive is failing or got corrupted? I can look into UPS but it clearly went down without a UPS so something could have broken there somehow...
-
@rmeskill said in Terribly slow boot times and frequent boot freezes:
Is ZFS bad? I didn't explicitly choose it, it just seems it was installed that way
Noop, on the contrary. Is handles way better our 'new' disks that are not spinning plates, but 'sophisticated silicon gates' (SSD, nvme etc etc).
Still, and it's still me rambling : don't think hardware or software will protect you against power failure.
Power failures == bad.If this :
da0: Attempt to query device size failed: NOT READY, Medium not present
wasn't caused by the power failure, you have another issue. Most probably : drive not ok.
Get a new drive, and I'm pretty sure your issueTerribly slow boot times and frequent boot freezes
will be gone.
-
@Gertjan said in Terribly slow boot times and frequent boot freezes:
If this :
da0: Attempt to query device size failed: NOT READY, Medium not present
wasn't caused by the power failure
This error isn't explicitly from the power failure-it's coming up every boot now. But yeah, I don't know if it's from a failure with the drive or another hardware interface failure. I do, however, have 2x NVMe slots and, when I moved the drive it still had the same error coming up, so that should rule out the physical NVMe slot. I've opened a case with Kingston to see if they'll honor a RMA
-
That error from da0 is probably unrelated. It's not the NVMe drive. I'd guess that device has an SD card slot or something similar. It has no card in it so reports that media error.
-
@stephenw10 if so, then any ideas why my boots are taking 20-30 minutes? And if there might be some way to test/confirm an issue with the NVMe drive?
-
Where in the boot is it stalling?
Try pressing
ctl+t
when it's stalled. That should show you what process it's waiting for.You could also try booting verbose. Interrupt the boot at the loader menu to reach the loader prompt (OK>) and enter:
boot -v
That may give you additional details about what it's doing before the delay. -
-
Hmm, so in both those situation is does eventually boot?
We've seen some other device hit those but AFAIK they never boot from there.
Try booting verbose to get more output from the 2nd scenario.
-
@stephenw10 that's actually a good question, but I think no. Sometimes it freezes there and sometimes it boots. But I just had another power cut and ended up with this screen, it looks pretty damning for the NVMe:
Anyone have any suggestions for a good value/quality NVMe replacement?
-
Urgh, yeah that's not good. It's difficult to break ZFS just by removing the power. So, yes, could be a bad drive.