Fault tolerance on return of power
-
This post is deleted! -
@fg what UPS? Apcupsd for instance has an option “Hibernate UPS on powerfail” to turn off the UPS. Otherwise Arm devices don’t/can’t power off when Halted.
In general if power is lost on a 3100 it should be ok but ZFS (which the 3100 doesn’t have available) is better at that.
-
The 3100 will always try to boot when power is applied to it.
But, yes, since you can't use ZFS on the 3100 you should consider enabling RAM disks. I've yet to see a filesystem problem that prevented boot when using UFS with ramdisks.
Steve
-
Hey Stevo,
I did as you suggested. My 3100 was using 31% of memory of 2 gigs. My ramdisk recommendation was minimum 40 for /tmp and 60 for /var. So I set it for 100 megs for /tmp an 150 for /var (keeping that 50% more ratio on /var). I set the backups for dhcp/logs/RRD/captive-portal-data for 4 hours. And re-booted.
Everything came up roses except pfblocker-ng daemons hadn't restarted. I tried to add a wake-on-lan widget and the 3100 crashed like a bull in the bullring when the el estoque is delivered. :0)
I had to putty in and turn back time.
It did come back up fine with all the daemons running. This has been my problem with pfsense since going back many years now. It is not a fault tolerant system. Between this example and any number of updates for Pfsense Communist and Pfsense+ I've had to get back up to speed with product and restore what was broken on-site. I don't think I can rely on offsite management. I'm not that good. Snort has always been an issue.
When you leave the system alone it runs a long time. And presumably well but since I don't touch it I'm of course presuming.
Thanks for your input. I'm going to try one more time and see what happens. But if you got any secrets I'd appreciate you sharing them with me.
-
Well, I tried 3 times more with the last try using the minimum 40 and 60 megs.
This looks like the culprit.
Something about an uncaught error in line 76 of interfaces.inc which comports with the experience on the matter.
-
Hmm, you simply enabled ramdisks and then rebooted and it failed to come back up each time?
Or is that a forced power cycle each time?
-
@fg I have a APC SCL500RM1UC UPS. A USB cable connects it to my 3100 and I have the Apcupsd package installed on my 3100. A HALT is issued to the 3100 when the UPS remaining runtime is 3 minutes,
My 3100 has always rebooted successfully after a power failure.
-
@fg https://docs.netgate.com/pfsense/en/latest/config/advanced-misc.html#ram-disk-sizes
“The suggested sizes on the page are an absolute minimum and often much larger sizes are required.”
There’s a note on the setting page also. On a 3100 we use 128 and 512 as I recall. Of course it depends entirely on what is using it. Some pfB lists don’t fit in 1GB.
-
@stephenw10 It would come back up so you could terminal in with the cable to fix things but the website GUI even restarted wouldn't work and internet access went off line. It kind of goes with the uncaught error regarding line 76 in the file interfaces.inc.
-
This post is deleted! -
@fg Wait, so you're saying if you pull power you get that error at boot every time? That's not normal. We had a bunch of 3100s in the field and I'm sure some have lost power over the years.
What happens if you Diagnostics/Reboot?
Pulling power isn't great but it's not usually fatal unless there is file system corruption from that incident.
https://docs.netgate.com/pfsense/en/latest/troubleshooting/filesystem-check.html -
This might be my solution.
I guess the power goes out. The Apcupsd package then monitors the UPS so that when three minutes are left it halts the 3100 but the 3100 stays powered on, right? Halt doesn't cut power so the battery keeps draining and then finally is used up. In this state when the power comes back on the UPS sends a "wake-on-lan" signal through the USB cable and the 3100 comes back on? What if the UPS doesn't get completely used up when the power comes back on while the 3100 is still lit up on halt? Does the "wake on lan" via usb cable still restarts the 3100?
Thank you for you help on this.
-
@fg The “Hibernate UPS on powerfail” checkbox I mentioned above should turn off the UPS after the pfSense shutdown happens. That way power is cut to the 3100. Then when power returns the UPS turns on and the 3100 has power again.
I believe that defaults to unchecked.
WOL is for across the network. I don't think Netgate routers support WOL. (One can send a WOL to a MAC address from pfSense IIRC).
-
@SteveITS I agree with you. I did go with 100 and 150 before going back down to 40 and 60. Testing to see the issue. Like I said last night my memory usage was 31% and right now 15%. I would have increased it to higher memory reservation pending results.
I wonder why the error I saw was about line 76 in the interfaces.inc file that was reported. Do the interfaces cut off when too much memory is alloted?
-
@SteveITS Thanks. You've cleared it up very well for me. Take care if I don't come back to pick your brains some more.
-
@fg I don't know about the interface error you're seeing.
As of a few versions (years) ago pfSense uses tmpfs for RAM disks so no longer preallocates the RAM....it is allocated when files are written. We used 128 and 512 MB on 3100s without issue but YMMV of course. (like I wrote above at least one big pfBlocker list uses well over 1 GB)
-
@fg said in Fault tolerance on return of power:
It kind of goes with the uncaught error regarding line 76 in the file interfaces.inc.
But just to be clear you are seeing that error after simply rebooting? Or after forcibly power cycling it?
Because there should never happen by simply rebooting it.
And, yes, if a UPS does not remove power to the 3100 it will not boot up after being halted. It needs to be power cycled at that point.
-
Thanks Steve. I'm going to be taking the 3100 up North for that property and going back to community on an Atom 27xx board I built some time ago. More hardware capacity.
I like the 3100 but the 16 gig RAM atom will do me better short term until I get the 4xxx Netgate. ;-)
Thanks for all your help.
-
Well, to be clear... it was on rebooting I got the error message. It has been explained to me that tmpfs(?) has taken over ramdisk duties. With only two gigs I'm more comfortable going back to a community appliance I built. Taking the 3100 to a cabin in the woods and letting time sort out what NextGate I might get. Thanks for all the help. Couldn't be more appreciated.
-