PfSense 2.2 503 - Service Not Available

Marv21

Hey guys

Since today we have the same issues. I shutdown the Firewall (no virtualisation / native installation with lagg) at friday and startet it at sunday. After the first start i got into the https webpage, had an issue with the apinger service and restart over ssh.

After the restart I can't connect to vpn, ssh and the webpage- same issue as all here.

No SSH?`No VPN -> other Problem. Here are only People with no access to the webgui. Everything else is fine

Overlord

Hey guys

I found the issue today. The system killed the group "wheel" and after this it was not possible for the system to create the
php-fpm.pid and php-fpm.socket file.

Greetz

gessel

This is massively, massively bad. Like "blocker" hair on fire bad. One of the great things about pfSense has been the appliance-like reliability which permits confident headless operation and operation in challenging environments where power isn't reliable. It has always just come back from power failures.

Today our UPS went batshit. It happens a lot here (in Iraq) where the AC line voltage varies from 80-260V, 40-65Hz, and goes out about 6-8 times per day. A UPS just doesn't last long under that kind of abuse (and this is a tiny little logic supply fanless box running on a SmartUps 3000 rack-mount, so it should have at least a day's run-time, which it needs on generator service days).

Today I got the 503 Service Unavailable error after spending half the day replacing the UPS. I take the time to drag a monitor up to the data cabinet (sealed, air:air self-cooled) and I see the attached:

(searchable as)
[ERROR] [pool lighty] cannot get gid for group 'wheel'
[ERROR] FPM initialization failed

And I tried restarting several times, restarting the webconfigurator (11), and restarting PHP-FPM (16), and then found this thread and realized it was to no avail. Time to reconfigure the network to permit download of the current version (this one has been UI upgraded for the last 3 years) and hope the last config backup has the DHCP assignments for the 60 or so machines that were added recently.

How far back would one have to downgrade to escape the over-eager fsck?

Bluethunder's suggestions are good, but in my case, it is the UPS that is the problem.

Migrating to boot from ZFS would escape FSCK completely. I boot from ZFS on my FreeBSD servers already and it is quite reliable and fairly easy to configure now that it is integrated into the 10.1 installer.

It might also help as a stop-gap to specify fsck_y_enable="NO" in /etc/rc.conf. You'd hang at startup, but at least you could intervene in the FSCK process and possibly prevent the system from eating itself.

20150410_211807_pfSense_fucked.jpg_thumb

lw9474

We have the same issue with one in the Bahamas. Power issues over the weekend and now not able to access the router. It is running but not passing any traffic. No DHCP and error 503 on the gui. Unfortunately SSH is turned off.

gessel

If you can get console (or have someone do it) it is pretty easy to pull the config off```
/cf/conf/backup


A remote KVM with virtual media adapters may be essential with newer versions of pfSense that are at risk until this is addressed (which may be never).

jimp

Until we figure out how to fix this (it's a FreeBSD/fsck issue) it might also be wise for those prone to multiple instances of it to keep a tarball of /etc somewhere… If it breaks then untar the file back over /etc, reboot, and keep going.

doktornotor

Is there any way to disable fsck altogether (without breaking non-interactive boot)? Since, this does more harm than good apparently.

jimp

No. If we disable the call to fsck it won't mount the slice and will drop to a console… and the fix is to run fsck. catch-22.

gessel

If it is possible to recover from a tarball, then perhaps a script that runs on startup that tests for some indication of this problem and automatically executes recovering /etc from the archive? An ugly hack, but the problem I could easily see for myself (pfSense instances running 20 hours of travel apart) is that the manual fix is not an easy talk-through for a non-technical hands-on person and if the system goes down and it is awfully hard to get in from the WAN side to do the work remotely.

I don't want to attempt this again unintentionally, but does SSH successfully start when this happens and are the rules that permit WAN side access working?

Otherwise a remote box pretty much necessitates a remote KVM on an accessible IP outside the firewall to give console access. Having a tarball of /etc/ squirreled away would save from reinstalling and I'll prepare my instances for the worst by doing that and making sure WAN side SSH works.

jimp

It may be possible to make an ugly hack like that, but it's not something we'd actually code up and put in the images (not that I can see happening anyhow) unless things got really desperate.

For those especially prone to this, you might also try adding "sync,noatime" (sans quotes) to the mount options for the disk in /etc/fstab – in my testing it still ran fsck and found errors but I didn't see any corruption. Though whether that was pure luck or due to the change is unclear yet. For example:

Before:

/dev/ufsid/552d6d027debc466		/		ufs	rw		1	1

After

/dev/ufsid/552d6d027debc466		/		ufs	rw,sync,noatime		1	1

Disk performance may take a slight hit for that but if it does help, it's worth the extra stability.

gessel

This seems like a sensible fix. It should help reduce the risk of corruption on data loss. The mitigants seem to me:

Make sure SSH access works from wherever one needs to manage a dead firewall from (probably WAN)
Backup /etc to someplace sensible
adjust /etc/fstab to trade performance for reliability

Hopefully this will get sorted.

I would think that moving to boot on ZFS would be a reasonable migration path. No more fsck.

jimp

zfs is more of a long term goal (and it is one of our goals, definitely) – not something we can implement fast or without lots of testing, and not an option for upgrades. So it is great for the future, but not what we need to fix right now.

prairie-sky

I'm having this same issue at a remote site.

does it just kill the GUI or does it kill the routing as well? I can still ping the box but I'm really hoping it's still allowing traffic to flow through…...

yaplej

I just ran into this issue on two VM instances of pfSense I was setting up. Iv been struggling to get CARP working between two KVM hosts and would reboot the hosts without shutting down the pfSense VM first (simulating power failure). Have now re-installed the pair 3 times. Trying the ",sync,noatime" option in /etc/fstab to see if it prevents needing to re-install.

gessel

yaplej: please report if it does help - seems like you're doing the right kind of testing to verify. I've made the changes on all my pfSense instances: fingers crossed power doesn't go out at a remote site and kill it.

donpfsform

I had this happen after a power outage. Version 2.2.0. I received 503 error on the web console, tried enabling ssh from the console with no luck, routing did work as long as I gave the workstation a static IP and and a DNS other than pfSense. I ended up reinstalling to fix the issue. I installed several instances this time to different partitions so I can at lease boot something. http://www.blog.unflap.com/2009/12/28/dual-boot-pfsense-for-testing-new-versions/ You can just choose to go back to the main menu instead of reboot and can install as my instances as you need.

Note for some reason a clean install of 2.2.2 would not boot on my Dell Optiplex 320. I would get the F1, F2, F3 selection and then reboot in an endless loop. It had listed ad0 for the drive. I finally went back to 2.1.0 and installed multiple instances without issue to ad4. Not sure why the drive letter change. All three instances updated to 2.2.2 and a config restore just fine.

KOM

I just had this happen after installing Bandwidthd on a new test 2.2.2 config that I spun up today. No power outage or anything else strange going on. I selected the LAN interface and clicked Save. Then I went to Access Bandwidthd and got:

Please start bandwidthd to populate this directory.

I went back to the Bandwidthd config page and again clicked Save, and that's when I got the 503. All WebGUI attempts now give 503 until I restart PHP-FPM. I have never, ever seen this before until now with bandwidthd.

With PHP-FPM restarted, everything appeared to be normal again. Then when I went to my dashboard, all the widgets were good except for the NTP widget which showed 503, and my LAN traffic gaph which shows

Cannot get data about interface vmx1

ntopng seems to have died as well.

pfsense.png_thumb

jimp

That's likely a different root cause than the filesystem corruption others are seeing.

KOM

And the problem persists through a reboot. Now both squid and squidGuard, which were working fine, are crashing as fast as they can start. WebGUI gives the exact same display as my screencap, with 503 for the NTP widget and the LAN graph not able to talk to vmx1. Bizarre.

doktornotor

With the squid* censored, I's pretty much possible you are getting an unclean reboot – which is perfectly enough to screw the filesystem.

On that note, I wonder if anyone tested with fsck from some previous FBSD versions. The one in 10.1 is simply mad.