Failure to re-start properly, possibly since 21.05
-
I have a really weird issue that started last night at 12:07am, although it seems to be re-boot related so the problem may have existed for some time, possibly since the last reboot which was probably when 21.05 was installed about a week ago.
The issue is that if I reboot my SG-3100, when it starts back up many services seem to not be running! Initially I can't get in to the web interface until I SSH on and re-start the web interface manually. Once I am in to the web interface I can see that DHCP and Watchdog Daemon are not running (both start OK when I manually start them). But worse still, no traffic is flowing out to the DSL line or between Lan1 and the Opt1 port! It's as if routing is also stopped, although I don't know how to check if this is the case?
In the end I restored a backup from a few months back and initially it worked, so I set about comparing the backup with the current config so I could manually re-apply changes (mainly minor firewall changes). But after a reboot, it stopped working again, and I was back to having to manually start the web interface etc.
I spent ages doing and undoing the firewall changes one at a time but in the end it seems to me that it has nothing to do with the actual config, just that it seems to 'mostly' work if I restore a backup for the first boot, but after another reboot it fails again!
I tried a factory reset, restored a backup (tried both current and the old one) and again it worked, until the next reboot.
How can I diagnose what is failing during reboots?
The other odd thing, which doesn't seem to make much sense, is that some restored settings (enabling SSH and the SSH admin password, but not the web interface one) seem to only apply after the second reboot. What I mean is I had to use a USB cable to enable SSH and then after restoring the backup the SSH remained enabled (even though it wasn't enabled in the backup) and I noticed the default SSH password persisted after the factory reset even after restoring backups. However, a reboot or so later, SSH disabled itself again and the password changed back to the one from the backup. This behaviour may be unrelated, but it isn't working logically and maybe that is why it fails only after a reboot.
-
The first thing to do is connect to the USB console and reboot it. Monitor the entire boot process and see what happens.
It sounds to me like some part of the boot process is getting hung up or it terminates early so it never finishes completely.
-
@jimp Sorry for the delay, but all had been working OK as long as I left alone and I hadn't had time to poke it with a stick (i.e. reboot it) unnecessarily. However, this morning I had to do a reboot and sure enough, it failed to start up again!
So, found a USB cable and connected up to the console and think I found the source of the problem:
This is the last section of the log when it failed:
Configuring IPsec VTI interfaces...done. Configuring CARP settings...done. Syncing OpenVPN settings...Segmentation fault (core dumped) Starting CRON... done. Netgate pfSense Plus 21.05-RELEASE arm Tue Jun 01 16:52:45 EDT 2021 Bootup complete
So I disabled my OpenVPN servers and rebooted again and it worked:
Configuring IPsec VTI interfaces...done. Configuring CARP settings...done. Syncing OpenVPN settings...done. Configuring firewall......done. Starting PFLOG...done. Setting up gateway monitors...done. Setting up static routes...done. Setting up DNSs... Starting DNS Resolver...done. Synchronizing user settings...done. Starting webConfigurator...done. Configuring CRON...done. Starting NTP Server...done. Starting DHCP service...done. Configuring firewall......done. Generating RRD graphs...done. Starting watchdog daemon...done. Triggering packages reinstallation in background Starting syslog...done. Starting CRON... done. Netgate pfSense Plus 21.05-RELEASE arm Tue Jun 01 16:52:45 EDT 2021 Bootup complete
So, I guess the question is, what is a segmentation fault in OpenVPN (which works fine for the OpenVPN servers I have set up, although I am now having to manually re-enable them after boot)? And why would a fault in one specific service like this stop the rest of the boot sequence from happening?
-
It's almost certainly hitting a PHP crash there, which is a known issue with a workaround available.
-
@jimp Ahh, yes, thanks. I saw the Snort issue and thought it looked very similar, so I guess I have just found another cause of the same bug.
The temporary fix appears to be to run
ini_set("pcre.jit", "0");
But it doesn't say where or how to run this? I can connect via SSH or the web interface, so can I do this from either of those do you know? And would it be as simple as changing the 0 to a 1 after the next release arrives to revert it?
Failing that, do we know how far off the next release is, as I could always just disable OpenVPN before a do a restart (and hope the UPS sees us through any power cuts) until then?
-
The link to the workaround above has instructions on how to add a patch to make the fix persistent.
-
Just a quick update to say that I am pleased to confirm 21.05.1 fixes this issue.
-
Thanks for the follow up.