Another Netgate with storage failure, 6 in total so far
-
@andrew_cb I installed a KingSpec NVMe 256GB drive from Amazon this morning. I followed the guidance you provided and the procedure to wipe the MMC drive in the documents.
It's back up and running now. The only thing that seems to be different is I no longer have the slowly flashing diamond that shows "boot complete/ready". I do get the flashing square during boot, and the orange circle when I put it in standby. I checked that none of the "light pipes" were bent, so that is a head-scratcher. Now when it's running all the LEDs are dark. I took away the memory disk I had implemented to try and prolong the mmc drive because it filled up during my reinstall and package reload.
Best I can tell it's all good except for the LED issue.
-
Hmm, odd. Try running:
pfSense-led.sh ready
-
@stephenw10 I'm running that command from where? And what output am I expecting?
-
@dstaylor said in Another Netgate with storage failure, 6 in total so far:
@stephenw10 I'm running that command from where? And what output am I expecting?
I found it down in /usr/local/sbin. I ran the command but there is no change.
-
Hmm, so just no LEDs? You can try setting any of the other states:
pfSense-led.sh usage: pfSense-LED booting pfSense-LED ready pfSense-LED update [1|0] pfSense-LED updating
It should update the LEDs accordingly. 'ready' is the normal state after booting.
-
@stephenw10 I pulled the box apart again. It was a keyboard/chair interface issue; the "flap" around the lighthood had folded over a little bit and was blocking the LED.
Now the LEDs are working as expected.
-
Please do follow the advice of zero'ing out your emmc when installing an ssd. I did not (I'm not sure that advice was even there at the time, I don't recall where I got the pointers to install my own ssd in the first place, but other than the special requirements on the type of ssd supported, I don't remember it being a particularly difficult install physically), and it worked fine until it didn't. I appreciate the help of staff here to guide me out of that predicament.
All of this discussion does make me wonder where I put my original pfsense media. It would seem to be a really bad idea not to have a recovery image in case it's needed (gulp).
-
@dnavas said in Another Netgate with storage failure, 6 in total so far:
bad idea not to have a recovery image
You can get the Netgate Installer from the store (free). It will download the latest version when you run it. Actually I think they added, or were talked about adding, a version selection.
https://docs.netgate.com/pfsense/en/latest/install/index.html
-
@marcosm I am glad Netgate is committed to addressing the issues raised in this thread. When do you think Netgate will be able to share a high-level overview of the planned changes so the community can give feedback? It could be a new thread to help keep the discussion focused.
Feel free to contact me publicly or privately if there is anything I can assist with.
-
Add me to the list. I have a number of 4100 base models installed in a different time zone that, had I known, would have been upgraded at purchase a couple of years ago. The first of them just just died on me. Fortunately I had a replacement on my shelf that I overnighted with a backup config. However, the remainder which were similarly configured are now likely at high mortality risk which led me to this thread for options.
For reference booting the dead 4100 and attempting to install a freshly downloaded copy from the store finds no valid storage devices.
If Netgate sold an appropriately overpriced nvme with B key on the store for the 4100 I would buy them just to sleep comfortably again.
In the meantime I can only turn off the packages I naively installed and cross my fingers.
-
You can also enable ram disks to reduce drive writes significantly.
-
@stephenw10 Thanks for the reminder, back in the day ram disks were the norm (and I may be deluding myself recalling also the default config) on my Alix boxes. I got lazy assuming these boxes would be preconfigured for longevity.
-
@stephenw10
4100 have only 4GB of RAM? In some use cases with such a small amount of memory, it's already barely enough for the device to function. And if you add a RAM disk... You'll have to significantly cut down on resource usage and configure it so that there are very few writes in general.My hardware is not Netgate, and my SSD is a Samsung Pro SATA 256GB and RAM size is 16GB.
Over almost five years, I have accumulated about 32TB of writes (~20GB per day). Recently, I started using RAM disks, reducing writes to around 1GB or less per day. Enabling RAM disks was not straightforward—it required a lot of trial and error.
In the end, I optimized log writing and analyzed the temp folder to understand what was writing to the disk and why. However, even now, /tmp is set to 1024, and /var to 8192 due to stability concerns.
I think the main problem is that eMMC wear can easily go unnoticed. And perhaps there should have been a preset in the Plus version that, where possible, checks the wear status and notifies the user well in advance of a critical state in every possible noticeable way, through all kinds of alerts. I’m not sure if it’s already too late for this or not...
-
@w0w said in Another Netgate with storage failure, 6 in total so far:
4100 have only 4GB of RAM? In some use cases with such a small amount of memory, it's already barely enough for the device to function. And if you add a RAM disk... You'll have to significantly cut down on resource usage and configure it so that there are very few writes in general.
Well, that depends on quite a lot of things. My edge device here is a 3100. That has 2GB RAM and I run RAM disks on that at 80/160MB without issue. That's running pfBlocker and Snort.
But obviously I didn't just enable all the lists and signatures.
-
@w0w Recent versions of pfSense don’t allocate the RAM disk space until it’s used, so it’s more flexible.
-
@stephenw10 curious I have an 1100.
Been running into OOM situations with pfblocker and some services like snmp crashing.
I am somewhat concerned about the writes but if I move to RAM disk do my issues go away? -
@michmoor A RAM disk seems unlikely to help a low memory problem. You could limit writes as noted above though.
-
Fortunately all of my 4100's are only using a small portion of their RAM (especially since I can't really use any additional packages causing emmc wear now that I know better) whereas they exhausted 100% of their emmc writes in just a couple of years so the RAM disk is a helpful stopgap until I am able to physically access the devices again.
Unfortunately, it looks like the reboot necessary to engage the ram disk probably took out another one today!
-
@stephenw10 said in Another Netgate with storage failure, 6 in total so far:
My edge device here is a 3100
This one does not use ZFS, is not it?
And I also noticed that pfSense very often incorrectly displays the actual size of /tmp and /var.@SteveITS said in Another Netgate with storage failure, 6 in total so far:
Recent versions of pfSense don’t allocate the RAM disk space until it’s used, so it’s more flexible.
Yep, but for some reason (like a huge syslog file), I have been running out of space several times.
-
And I want to repeat once again: the problem is not whether the RAM disk is enabled, whether to enable it, or how to do it. The problem is that disk wear goes unnoticed by the user, and they only start paying attention when the device has already died or is in a critical "almost dead" state.
So maybe, I don’t know, it's worth updating the documentation and, through some kind of newsletter, news post, or blog, recommending that users perform checks and follow the recommendations in the updated documentation?