Pfsense 2.5 stacks at boot with dots
-
@netblues same here. There was no power loss. One minute the system was running fine and the next minute, the first sign of trouble was the spinning wheel on the TV YouTube stream.
Ran fsck at least three times with no satisfaction.
Under 2.4 I'd had numerous power outages and brown outs and the system kept coming back every time without a hitch.
-
@gertjan said in Pfsense 2.5 stacks at boot with dots:
If there was a power loss, the file system can get hosed.
In my experience, this is only conditionally true. There are certain times where it's ok to kill the power to the device and certain times where it tends to cause trouble. Instances where I've observed this directly include:
-
Instructing an end-user on site to power cycle the device using the rocker switch. In this particular case, I was on the phone with the user while they were trying to locate the switch. During the call, they flipped the switch once but weren't sure whether they did the right thing until after repeating to me what they did, when I said "yes that's the correct switch" they flipped the switch again, interrupting the boot process. The device needed to be recovered in single-user mode by running
fsck -fy /
about ~5 times or so. -
Powering the device off via the rocker switch within about a minute of changing the LAN IP address via console. It seems there was still something happening in the background for a little while after this, and powering it down almost immediately after logging out (back to the console menu) appears to have zeroed-out the
/cf/config/config.xml
and/or created a zero-byte config file in/cf/config/backup
. This was the cause of the problem that led me to this discussion thread.
But! If the device has been running and no recent config changes have been made, it seems to be totally, 100% OK to kill the power unexpectedly. In my organization, we always deploy an UPS with each of our devices, but depending on the duration of a power outage, the UPS will run out of juice and everything will go down. I have yet to see a device not come back up after a power event of that nature.
DISCLAIMER: We have only ever deployed the XG71001U (except for one or two repurposed Dell servers we currently have or had in production) so everything I've said above only applies to those machines.
-
-
@sjm said in Pfsense 2.5 stacks at boot with dots:
@netblues same here. There was no power loss. One minute the system was running fine and the next minute, the first sign of trouble was the spinning wheel on the TV YouTube stream.
Ran fsck at least three times with no satisfaction.
Under 2.4 I'd had numerous power outages and brown outs and the system kept coming back every time without a hitch.
Same here...
It IS very dangerous to be left in the cold like that in a production environment.
And since it isn't just me, and is not only under kvm, it has to BE something inside pfsense code and/ or BSD.For the time being we need to find a way to reproduce the issue so a proper bug report can be filed.
-
@gertjan said in Pfsense 2.5 stacks at boot with dots:
A zero bytes config file like config-xxxxxxxxxx.xml can't exist.
This is 100%, demonstrably untrue. This was the root of my problem yesterday, however looking back on yesterday's terminal transcript, I realize why the device was trying to load the most recent backup:
It seems the
/cf/config/config.xml
file was also empty. The errors below complain that the file doesn't exist, but shortly after seeing those and dropping to a shell, I confirmed that the file definitely existed and recalling from memory ( ... since I neglected to capture thels -l
output showing this) the file was also zero bytes. I didn't pay much attention to this at the time and instead focused on the last bit of output where a backup config was being loaded. I didn't deleteconfig.xml
and instead opted to delete the only empty backup config, under the assumption that the next newest one would be loaded in its absence (which turned out to be exactly the case).So it would appear that an empty
config.xml
won't cause this problem by itself; the system will boot just fine if you have a good backup config in/cf/config/backup
.Here's some console output that I believe supports what I said, above:
2019-01-15T22:13:03.669685-05:00 php-fpm 905 - - /ecl.php: No config.xml found, attempting last known config restore. 2019-01-15T22:13:03.670131-05:00 php-fpm 905 - - /ecl.php: New alert found: No config.xml found, attempting last known config restore. 2019-01-15T22:13:03.672266-05:00 php-fpm 905 - - /ecl.php: XML error: no pfsense object found! 2019-01-15T22:13:03.672335-05:00 php-fpm 905 - - 2019-01-15T22:13:03.672600-05:00 php-fpm 905 - - /ecl.php: Netgate pfSense Plus is restoring the configuration /conf/backup/config-1547535734.xml 2019-01-15T22:13:03.672810-05:00 php-fpm 905 - - /ecl.php: New alert found: Netgate pfSense Plus is restoring the configuration /conf/backup/config-1547535734.xml 2019-01-15T22:13:03.674754-05:00 php-fpm 905 - - /ecl.php: XML error: no pfsense or pfsense object found! 2019-01-15T22:13:03.674927-05:00 php-fpm 905 - - 2019-01-15T22:13:03.675529-05:00 php-fpm 905 - - /ecl.php: XML error: no pfsense object found! 2019-01-15T22:13:03.675593-05:00 php-fpm 905 - - 2019-01-15T22:13:03.675805-05:00 php-fpm 905 - - /ecl.php: Netgate pfSense Plus is restoring the configuration /cf/conf/backup/config-1547535734.xml 2019-01-15T22:13:03.676054-05:00 php-fpm 905 - - /ecl.php: New alert found: Netgate pfSense Plus is restoring the configuration /cf/conf/backup/config-1547535734.xml
Compare the filename of backup config file above, to the one I deleted in my post ... it's the same file.
Bottom line: Something happened to cause both of these to be zero-bytes and it brought the boot process to a dead stop.
Also interesting was the message that gets logged directly before it starts printing dots to the console:
2019-01-15T22:19:38.264066-05:00 php-fpm 1301 - - /ecl.php: New alert found: Netgate pfSense Plus is restoring the configuration /cf/conf/backup/config-1547535734.xml External config loader 1.0 is now starting... mmcsd0s1 mmcsd0s1a mmcsd0s1b Launching the init system...Updating CPU Microcode... CPU: Intel(R) Atom(TM) CPU C3558 @ 2.20GHz (2200.07-MHz K8-class CPU) Origin="GenuineIntel" Id=0x506f1 Family=0x6 Model=0x5f Stepping=1 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x4ff8ebbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,CX16,xTPR,PDCM,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,RDRAND> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x101<LAHF,Prefetch> Structured Extended Features=0x2294e283<FSGSBASE,TSCADJ,SMEP,ERMS,NFPUSG,MPX,PQE,RDSEED,SMAP,CLFLUSHOPT,PROCTRACE,SHA> Structured Extended Features3=0xac000400<MD_CLEAR,IBPB,STIBP,ARCH_CAP,SSBD> XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES> IA32_ARCH_CAPS=0x69<RDCL_NO,SKIP_L1DFL_VME> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr TSC: P-state invariant, performance statistics Done. .... done. Initializing.................. done. Starting device manager (devd)...devd: Can't open devctl device /dev/devctl: Device busy done. Loading configuration......done. Updating configuration.......................................................................................................^C.2019-01-15T22:19:41.788360-05:00 init 1 - - /bin/sh on /etc/rc terminated abnormally, going to single user mode
Updating configuration
... followed by a gazillion dots (until you see the process interrupted by^C
) I don't really know what's being updated here and how empty config files play into this process, but it's curious indeed. -
@sjm FWIW this occurred on a device running
21.02.2-RELEASE (amd64)
. It might only be something that happens with 2.5 and later. As many of these machines I have in production, and given how easily this issue occurred (for me), I'd say it's a bug in the newer versions of pfSense. -
@chamilton_ccn Do note that this isn't happening while booting, or changing the config.
Which makes it even more difficult to understand how files become 0 length.
Unless of course there is something automated running in the background. -
@netblues said in Pfsense 2.5 stacks at boot with dots:
@chamilton_ccn Do note that this isn't happening while booting, or changing the config.
Which makes it even more difficult to understand how files become 0 length.
Unless of course there is something automated running in the background.Yeah, that is strange for it to happen while the system is running. It's also something I haven't seen yet, which makes me a bit nervous. Just curious, did you actually find zero length config files in your situation? It's possible this issue has multiple causes; it just so happened that my situation was due to empty configs. Yours might be due to something else.
-
@sjm said in Pfsense 2.5 stacks at boot with dots:
Under 2.4 I'd had numerous power outages and brown outs and the system kept coming back every time without a hitch.
Same. The only time I've had a device not come back up is when there was a clear explanation as to why.
-
@chamilton_ccn I do remember seeing a zero length config.xml file.
-
@sjm said in Pfsense 2.5 stacks at boot with dots:
@chamilton_ccn I do remember seeing a zero length config.xml file.
Interesting! Do you know if the
/cf/conf/backup
directory was empty and/or whether the most recent backup config was empty? If you're uncertain and this happens again, definitely check that and report back! -
@chamilton_ccn said in Pfsense 2.5 stacks at boot with dots:
It seems the /cf/config/config.xml file was also empty.
I should repeat myself : impossible.
But I acknowledge : you saw it ... so it's possible.
I never saw such a thing for the past 10 years or so.It's not pfSense "2.5" as that version doesn't exist - is that 2.5.0 ? 2.5.1 ? Or 21.02.2-RELEASE (amd64) ? All of them ?
Meanwhile, some real info is shown now : there are logs !!.
So, now, it's known where the issue is : here /etc/config.lib.inc : the while loop at line 465, it loops again printing a dot ".".
I suppose not only the backup xml files are 'gone', but also the main /cf/conf/config.xml.
So, the code 'thinks' it needs to be "upgraded" as it retrieves a "0" as an initial version number..... and 0 + something changed by the config upgrade conversion becomes "B.S.".
That should be changed. The system should halt / bail out with a message like : "No valid config found - See you."
Now, it take the before last config .... which is zer, so it goes back to the before before last, which is zero (right ?).Normally, when a file gets created, there is content to be written to.
The creation process worked - but nothing gets written. Even a simple file copy doesn't work on your system.
Ok, so it's not a file system error.
I'll take the next best : PHP is brain dead ? The kernel is brain dead ? What's so special with your 'setup' that it is so messed up ?
Please keep going with the info.Take you pfSense to a new VM - or even better : a dedicate machine, and see that it behaves as thousands of others : it can create files, fill them with info - etc.
If /cf/conf/config.xml creation was an issue for many other, this would be a major show stopper.
So, use another VM, like Windows has the build in Hyper-V : I'm using two of them : works great. -
@gertjan It WAS pfsense 2.5.0 stable when it happened.
And is also confirmed to happen on official netgate hardware.
The same vm host runs happily various other workloads including pf 2.5.1 as we speak.
And no, this has nothing to do with the Hypervisor.
And Hyper-v isn't going to solve it anyways. -
@gertjan said in Pfsense 2.5 stacks at boot with dots:
But I acknowledge : you saw it ... so it's possible.
I never saw such a thing for the past 10 years or so.Oh it definitely happened, but I appreciate your skepticism :-) This is a first time for me as well.
EDIT: In my situation, it was version
21.02.2-RELEASE (amd64)
. -
@netblues
Hyper V as an alternative because I saw "pfsense 2.5 under centos8 kvm" in the beginning.Also because Netgate - I'm speaking for myself - did not test centos8 . Hyper-V was tested.
I'm not saying it's better. Just to get you on "common grounds".Btw :
[2.5.1-RELEASE][root@pfsense.outside.bdx.net.net]/cf/conf/backup: ls -al total 28788 drwxr-xr-x 2 root wheel 2048 May 5 08:17 . drwxr-xr-x 4 root wheel 2048 May 5 13:01 .. -rw-r--r-- 1 root wheel 10999 May 5 08:17 backup.cache -rw-r--r-- 1 root wheel 429728 Apr 27 17:17 config-1619536501.xml -rw-r--r-- 1 root wheel 430599 Apr 29 11:08 config-1619536620.xml -rw-r--r-- 1 root wheel 430590 Apr 29 11:08 config-1619687301.xml -rw-r--r-- 1 root wheel 430609 Apr 29 14:52 config-1619687320.xml -rw-r--r-- 1 root wheel 430647 Apr 30 08:29 config-1619700777.xml -rw-r--r-- 1 root wheel 430666 Apr 30 08:30 config-1619764170.xml -rw-r--r-- 1 root wheel 430709 Apr 30 08:39 config-1619764238.xml -rw-r--r-- 1 root wheel 430644 Apr 30 08:40 config-1619764777.xml -rw-r--r-- 1 root wheel 430653 Apr 30 08:40 config-1619764818.xml -rw-r--r-- 1 root wheel 430372 Apr 30 08:41 config-1619764822.xml -rw-r--r-- 1 root wheel 430279 Apr 30 08:41 config-1619764863.xml -rw-r--r-- 1 root wheel 430689 Apr 30 08:41 config-1619764867.xml -rw-r--r-- 1 root wheel 430666 Apr 30 13:41 config-1619764918.xml -rw-r--r-- 1 root wheel 430675 Apr 30 13:41 config-1619782889.xml -rw-r--r-- 1 root wheel 430394 Apr 30 13:42 config-1619782894.xml -rw-r--r-- 1 root wheel 430346 Apr 30 13:43 config-1619782965.xml -rw-r--r-- 1 root wheel 430281 Apr 30 13:43 config-1619783032.xml -rw-r--r-- 1 root wheel 430691 Apr 30 18:00 config-1619783036.xml -rw-r--r-- 1 root wheel 430624 Apr 30 18:00 config-1619798400.xml -rw-r--r-- 1 root wheel 430633 May 1 00:00 config-1619798431.xml -rw-r--r-- 1 root wheel 430624 May 1 00:00 config-1619820000.xml -rw-r--r-- 1 root wheel 430633 May 1 06:00 config-1619820004.xml -rw-r--r-- 1 root wheel 430624 May 1 06:00 config-1619841600.xml -rw-r--r-- 1 root wheel 430633 May 1 12:00 config-1619841604.xml -rw-r--r-- 1 root wheel 430624 May 1 12:00 config-1619863200.xml -rw-r--r-- 1 root wheel 430633 May 1 12:05 config-1619863231.xml -rw-r--r-- 1 root wheel 430655 May 1 12:06 config-1619863557.xml -rw-r--r-- 1 root wheel 430374 May 1 12:07 config-1619863565.xml -rw-r--r-- 1 root wheel 430337 May 1 12:07 config-1619863639.xml -rw-r--r-- 1 root wheel 430346 May 1 12:08 config-1619863646.xml -rw-r--r-- 1 root wheel 430281 May 1 12:08 config-1619863713.xml -rw-r--r-- 1 root wheel 430691 May 1 12:12 config-1619863720.xml -rw-r--r-- 1 root wheel 430534 May 1 12:12 config-1619863929.xml -rw-r--r-- 1 root wheel 430543 May 1 12:12 config-1619863966.xml -rw-r--r-- 1 root wheel 430262 May 1 12:13 config-1619863970.xml -rw-r--r-- 1 root wheel 430315 May 1 12:13 config-1619864004.xml -rw-r--r-- 1 root wheel 430279 May 1 12:13 config-1619864016.xml -rw-r--r-- 1 root wheel 430689 May 3 16:32 config-1619864022.xml -rw-r--r-- 1 root wheel 430668 May 4 07:29 config-1620052367.xml -rw-r--r-- 1 root wheel 430622 May 4 07:30 config-1620106188.xml -rw-r--r-- 1 root wheel 430609 May 4 07:31 config-1620106241.xml -rw-r--r-- 1 root wheel 429702 May 4 07:33 config-1620106271.xml -rw-r--r-- 1 root wheel 429703 May 4 07:34 config-1620106429.xml -rw-r--r-- 1 root wheel 429660 May 4 07:34 config-1620106440.xml -rw-r--r-- 1 root wheel 428464 May 4 07:35 config-1620106491.xml -rw-r--r-- 1 root wheel 428497 May 4 07:36 config-1620106534.xml -rw-r--r-- 1 root wheel 428498 May 4 07:36 config-1620106574.xml -rw-r--r-- 1 root wheel 427619 May 4 07:37 config-1620106612.xml -rw-r--r-- 1 root wheel 426756 May 4 08:41 config-1620106653.xml -rw-r--r-- 1 root wheel 425903 May 5 07:56 config-1620110479.xml -rw-r--r-- 1 root wheel 426800 May 5 07:57 config-1620194206.xml -rw-r--r-- 1 root wheel 426814 May 5 08:07 config-1620194237.xml -rw-r--r-- 1 root wheel 426814 May 5 08:09 config-1620194839.xml -rw-r--r-- 1 root wheel 426842 May 5 08:10 config-1620194993.xml -rw-r--r-- 1 root wheel 426861 May 5 08:10 config-1620195027.xml -rw-r--r-- 1 root wheel 426886 May 5 08:13 config-1620195040.xml -rw-r--r-- 1 root wheel 426924 May 5 08:15 config-1620195192.xml -rw-r--r-- 1 root wheel 426959 May 5 08:16 config-1620195341.xml -rw-r--r-- 1 root wheel 426991 May 5 08:16 config-1620195365.xml -rw-r--r-- 1 root wheel 427023 May 5 08:17 config-1620195391.xml
If one of these was zero, I would surely hit that big red alarm button right away.
I'm pretty sure the main config.xml was zeroed out also.
That's close to a Windows PC with a nuked registry file : that system will not boot, period.edit :
This :
@chamilton_ccn said in Pfsense 2.5 stacks at boot with dots:
/ecl.php: Netgate pfSense Plus is restoring the configuration
is Plus, - and I have no ecl.php file as I use the Community edition.
@chamilton_ccn said in Pfsense 2.5 stacks at boot with dots:
Here's some console output that I believe supports what I said, above:
Not really.
What is the
ls -al
at that moment ?
A "/conf/backup/config-1547535734.xml" is found, read, promoted to new config.xml but also found empty (== no objects).
Btw : I install this many moons ago : https://github.com/KoenZomers/pfSenseBackup : works great. A real set-it-and-forget-it-backup tool.
-
@gertjan said in Pfsense 2.5 stacks at boot with dots:
Not really.
Ok, but everything I said was tested, empirically. Or else I'd still have a non-booting device.
What is the
ls -al
at that moment ?There's no way to know at this point.
-
@chamilton_ccn will keep an eye out. As mentioned I had to restore service ASAP so when fsck and forum searches were unable to resolve the grief I went straight to reformatting and an external config.xml backup restore.
-
@sjm the only other remote thought is a comment from my wife. She said a number of people in town had been experiencing internet service disruption. Not sure if there is any relationship but if may help someone else to consider a possible link.
-
Ok done some digging; in summary.
- this happened on pfsense 2.5.1.
- pfsense was streaming a YouTube video to my TV at the time it crashed.
- It happened around 07:45 AEST Wednesday the 5th of May.
- NOTE:- the pfsense box was connected to a UPS. There were no audible warnings from the UPS indicating a power fluctuation or drop out in any case.
- pfsense had partly crashed. I.e. web interface had stopped responding. pfsense OS was still working.
- direct connected a keyboard and a monitor. Saw the spooling dots appearing on the screen.
- ctrl-c allowed me to bring up the command prompt.
- ran fsck after booting into single user mode, if I recall correctly, at least three times, in attempt to repair fs corruption. This did not succeed.
- had noted that config.xml had a zero byte count file size. - rebooted a number of times with the same failed result: endless dots appearing on the screen.
- scrubbed the install and did a complete reinstall with a reload of an external backup copy of config.xml.
- everything now appears to be operating normally.
- later on my wife indicated that the local chat had mentioned a number of people in the area where we live experiencing internet service issues (one would assume this would not be in any way connected with the crash I experienced with pfsense?!?).
- see photos and hope they help.
-
@sjm said in Pfsense 2.5 stacks at boot with dots:
It happened around 07:45 AEST Wednesday the 5th of May.
.....You saw the dots.
And yet you said :@sjm said in Pfsense 2.5 stacks at boot with dots:
pfsense had partly crashed.
When you see the dots, pfSense is booting.
Did you reboot (reset ?) it ?
If not, it crashed and rebooted itself.in attempt to repair fs corruption. This did not succeed.
If the file system doesn't get cleaned/repaired, it will get mounted in read only mode at best.
This means : not one single byte can get written to the system.
This is not good at all.
read only mode means that file's can't get created : no empty files - no file name. Nothing. The file system can't get altered any more.I would start doing some severe hardware / drive tests. Or just change the drive.
Btw : updating Facebook, doing nothing, or watching Youtube isn't related here ^^
A WAN (or LAN) disconnection should not 'break' or reset the system. Or mess up the file system.
These two events are not related.I mean : I can ripe out the WAN connector any time, power down my ISP router any time, ripe out the mains plug of my UPS (protects my ISP router / pfSEnse and main switch) any time.
I actually do so every month or so.
Before doing so, I look at ALL the logs for 'special' events.
I start removing power.
And afterwards, check how the system builds the WAN again - and check if the man LAN works normally.Btw : your photo's : try capturing the moment the kernel boots, and when it finds the drive dirty. It would (or you instructed it) start to file system check (fsck) and it should terminate with a clean system an continue booting.
-
@gertjan What we really don't know is if the system suddenly crashes and restarts and then hangs with dots, or dots start to appear without rebooting.
That's quite difficult to pinpoint.
I would say the first, but this is an assumption.