Pfsense 2.5 stacks at boot with dots
-
I'm experiencing a very serious system crash.
System has been up and running under kvm for weeks with no issues
pfsense 2.5 under centos8 kvm.After implementing the fq-codel workaround for icmp, and doing a few successful traces, I lost access to pfsense.
No ping, nada.
Looking at the console I would see screen full of dots.Pressing ctrl-c I got
Issuing a reboot, reports a clean filesystem, the usual boot procedure
and this is where it startsWhat is this?
I tried shutting down the vm and restarting.. Doesn't make any difference.
Any pointers, more than welcome.
-
Yeah I too have seen these on real intel hardware.. not sure what causes. but control+c and exit and another reboot works.. wow.. 2.5.
-
@netblues Would love to know the issue. I'm facing the same problem.
-
@vajonam I have the same exact problem and can't seem to get past this. No amount of CTRL+C & exit nor booting into single user mode and running
fsck -fy /
has worked.EDIT: FWIW im getting this on a brand new XG-71001U
-
Ok! Here's what seems to have resolved it for me:
#1 When you see the dots, hit
CTRL+C
#2cd /cf/conf/backup
#3 Look for a zero-byte config file and delete it. (Here's a paste from my terminal;not sure why the dates are wrong but at leastmost of these should be from this month):# ls -l total 2104 -rw-r--r-- 1 root wheel 0 Jan 15 22:19 config-1547535734.xml -rw-r--r-- 1 root wheel 31483 Jan 15 02:02 config-1547535747.xml -rw-r--r-- 1 root wheel 31475 Jan 15 02:02 config-1547535753.xml -rw-r--r-- 1 root wheel 32305 Jan 15 02:02 config-1547535762.xml -rw-r--r-- 1 root wheel 32297 Jan 15 02:03 config-1547535769.xml -rw-r--r-- 1 root wheel 33129 Jan 15 02:03 config-1547535791.xml -rw-r--r-- 1 root wheel 33953 Jan 15 02:03 config-1547535803.xml -rw-r--r-- 1 root wheel 33120 Jan 15 02:04 config-1547535808.xml -rw-r--r-- 1 root wheel 33121 Jan 15 02:04 config-1547535882.xml -rw-r--r-- 1 root wheel 33130 Jan 15 02:05 config-1547535888.xml -rw-r--r-- 1 root wheel 33982 Jan 15 02:05 config-1547535924.xml -rw-r--r-- 1 root wheel 33974 Jan 15 02:06 config-1547535930.xml -rw-r--r-- 1 root wheel 33983 Jan 15 02:07 config-1547535978.xml -rw-r--r-- 1 root wheel 33953 Jan 15 02:07 config-1547536024.xml -rw-r--r-- 1 root wheel 33928 Jan 15 02:07 config-1547536025.xml -rw-r--r-- 1 root wheel 34758 Jan 15 02:08 config-1547536061.xml -rw-r--r-- 1 root wheel 35587 Jan 15 02:08 config-1547536091.xml -rw-r--r-- 1 root wheel 36408 Jan 15 02:09 config-1547536133.xml -rw-r--r-- 1 root wheel 36428 Jan 15 02:09 config-1547536150.xml -rw-r--r-- 1 root wheel 36453 Jan 15 02:12 config-1547536151.xml -rw-r--r-- 1 root wheel 36526 Jan 15 02:12 config-1547536344.xml -rw-r--r-- 1 root wheel 36547 Jan 15 02:14 config-1547536366.xml -rw-r--r-- 1 root wheel 36686 Jan 15 02:20 config-1547536456.xml -rw-r--r-- 1 root wheel 36742 Jan 15 02:21 config-1547536856.xml -rw-r--r-- 1 root wheel 36971 Jan 15 02:22 config-1547536899.xml -rw-r--r-- 1 root wheel 37182 Jan 15 02:23 config-1547536929.xml -rw-r--r-- 1 root wheel 37409 Jan 15 02:23 config-1547537013.xml -rw-r--r-- 1 root wheel 37636 Jan 15 02:30 config-1547537039.xml -rw-r--r-- 1 root wheel 37703 Jan 15 02:53 config-1547537458.xml -rw-r--r-- 1 root wheel 37738 Jan 15 18:24 config-1547538793.xml # rm config-1547535734.xml
#4 (Likely unnecessary) Just to be on the safe side,
reboot
, and select option #2 (single-user mode) and then runfsck -fy /
a few (~5ish) times. Thenreboot
again.EDIT: Formatting
EDIT2: Duh, dates are wrong because system datetime was wrong (oops... I thought I set that.)
EDIT3: Formatting. Again.
-
@chamilton_ccn really appreciate your reply. It was a puzzling issue. Is there a known issue with pfs 2.5.1 as I have never experienced the likes of this under 2.4... ?
I had to restore the service ASAP so I wiped the drive, reinstalled pfs 2.5 and restored the latest config.xml backup.
Everything is up and running smoothly again but wondering if the hiccup that caused the corruption is still lurking in the background in pfs 2.5.1?
Regards
SjM -
I would have put it on the list :
If there was a power loss, the file system can get hosed.
This big list can have as a consequence :
Files are created, but have no content.Two solutions : use one or the other or both.
Not choosing one or two of these solution creats a lot of work.So :
Get an UPS
or/and
Watch a video. -
@gertjan The thing is that power was not lost, the vm did not crash.
Other vm's kept running happily on.I had to reinstall in the end. No amount of fsck could solve it
So in the end, does the zero byte config file deletion can fix the issue?
-
@netblues said in Pfsense 2.5 stacks at boot with dots:
So in the end, does the zero byte config file deletion can fix the issue?
A zero bytes config file like config-xxxxxxxxxx.xml can't exist.
These files should be copies of the current config.xml file.
Disk resources ok ?
No PHP RAM issues ?A config-xxxxxxxxxx.xml would only be used if the main /cf/conf/config.xml file is invalid. There will be error log messages in the system log telling you this. This is already a very bad situation.
Btw : be careful : Centos8 is a dead end.
-
@gertjan Well, if it was a php ram issue, a reboot would fix it.
I did restore to a previous snapshot and it also worked.
Now I'm keeping solid snapshots , so next time I'll have much more control
on the problematic instance.As for centos8, I'm fully aware of it.
Planning to move to the free redhat enteprise soon.This is my home office/lab setup. It can be migrated fully in a weekend.
-
@netblues same here. There was no power loss. One minute the system was running fine and the next minute, the first sign of trouble was the spinning wheel on the TV YouTube stream.
Ran fsck at least three times with no satisfaction.
Under 2.4 I'd had numerous power outages and brown outs and the system kept coming back every time without a hitch.
-
@gertjan said in Pfsense 2.5 stacks at boot with dots:
If there was a power loss, the file system can get hosed.
In my experience, this is only conditionally true. There are certain times where it's ok to kill the power to the device and certain times where it tends to cause trouble. Instances where I've observed this directly include:
-
Instructing an end-user on site to power cycle the device using the rocker switch. In this particular case, I was on the phone with the user while they were trying to locate the switch. During the call, they flipped the switch once but weren't sure whether they did the right thing until after repeating to me what they did, when I said "yes that's the correct switch" they flipped the switch again, interrupting the boot process. The device needed to be recovered in single-user mode by running
fsck -fy /
about ~5 times or so. -
Powering the device off via the rocker switch within about a minute of changing the LAN IP address via console. It seems there was still something happening in the background for a little while after this, and powering it down almost immediately after logging out (back to the console menu) appears to have zeroed-out the
/cf/config/config.xml
and/or created a zero-byte config file in/cf/config/backup
. This was the cause of the problem that led me to this discussion thread.
But! If the device has been running and no recent config changes have been made, it seems to be totally, 100% OK to kill the power unexpectedly. In my organization, we always deploy an UPS with each of our devices, but depending on the duration of a power outage, the UPS will run out of juice and everything will go down. I have yet to see a device not come back up after a power event of that nature.
DISCLAIMER: We have only ever deployed the XG71001U (except for one or two repurposed Dell servers we currently have or had in production) so everything I've said above only applies to those machines.
-
-
@sjm said in Pfsense 2.5 stacks at boot with dots:
@netblues same here. There was no power loss. One minute the system was running fine and the next minute, the first sign of trouble was the spinning wheel on the TV YouTube stream.
Ran fsck at least three times with no satisfaction.
Under 2.4 I'd had numerous power outages and brown outs and the system kept coming back every time without a hitch.
Same here...
It IS very dangerous to be left in the cold like that in a production environment.
And since it isn't just me, and is not only under kvm, it has to BE something inside pfsense code and/ or BSD.For the time being we need to find a way to reproduce the issue so a proper bug report can be filed.
-
@gertjan said in Pfsense 2.5 stacks at boot with dots:
A zero bytes config file like config-xxxxxxxxxx.xml can't exist.
This is 100%, demonstrably untrue. This was the root of my problem yesterday, however looking back on yesterday's terminal transcript, I realize why the device was trying to load the most recent backup:
It seems the
/cf/config/config.xml
file was also empty. The errors below complain that the file doesn't exist, but shortly after seeing those and dropping to a shell, I confirmed that the file definitely existed and recalling from memory ( ... since I neglected to capture thels -l
output showing this) the file was also zero bytes. I didn't pay much attention to this at the time and instead focused on the last bit of output where a backup config was being loaded. I didn't deleteconfig.xml
and instead opted to delete the only empty backup config, under the assumption that the next newest one would be loaded in its absence (which turned out to be exactly the case).So it would appear that an empty
config.xml
won't cause this problem by itself; the system will boot just fine if you have a good backup config in/cf/config/backup
.Here's some console output that I believe supports what I said, above:
2019-01-15T22:13:03.669685-05:00 php-fpm 905 - - /ecl.php: No config.xml found, attempting last known config restore. 2019-01-15T22:13:03.670131-05:00 php-fpm 905 - - /ecl.php: New alert found: No config.xml found, attempting last known config restore. 2019-01-15T22:13:03.672266-05:00 php-fpm 905 - - /ecl.php: XML error: no pfsense object found! 2019-01-15T22:13:03.672335-05:00 php-fpm 905 - - 2019-01-15T22:13:03.672600-05:00 php-fpm 905 - - /ecl.php: Netgate pfSense Plus is restoring the configuration /conf/backup/config-1547535734.xml 2019-01-15T22:13:03.672810-05:00 php-fpm 905 - - /ecl.php: New alert found: Netgate pfSense Plus is restoring the configuration /conf/backup/config-1547535734.xml 2019-01-15T22:13:03.674754-05:00 php-fpm 905 - - /ecl.php: XML error: no pfsense or pfsense object found! 2019-01-15T22:13:03.674927-05:00 php-fpm 905 - - 2019-01-15T22:13:03.675529-05:00 php-fpm 905 - - /ecl.php: XML error: no pfsense object found! 2019-01-15T22:13:03.675593-05:00 php-fpm 905 - - 2019-01-15T22:13:03.675805-05:00 php-fpm 905 - - /ecl.php: Netgate pfSense Plus is restoring the configuration /cf/conf/backup/config-1547535734.xml 2019-01-15T22:13:03.676054-05:00 php-fpm 905 - - /ecl.php: New alert found: Netgate pfSense Plus is restoring the configuration /cf/conf/backup/config-1547535734.xml
Compare the filename of backup config file above, to the one I deleted in my post ... it's the same file.
Bottom line: Something happened to cause both of these to be zero-bytes and it brought the boot process to a dead stop.
Also interesting was the message that gets logged directly before it starts printing dots to the console:
2019-01-15T22:19:38.264066-05:00 php-fpm 1301 - - /ecl.php: New alert found: Netgate pfSense Plus is restoring the configuration /cf/conf/backup/config-1547535734.xml External config loader 1.0 is now starting... mmcsd0s1 mmcsd0s1a mmcsd0s1b Launching the init system...Updating CPU Microcode... CPU: Intel(R) Atom(TM) CPU C3558 @ 2.20GHz (2200.07-MHz K8-class CPU) Origin="GenuineIntel" Id=0x506f1 Family=0x6 Model=0x5f Stepping=1 Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE> Features2=0x4ff8ebbf<SSE3,PCLMULQDQ,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,SDBG,CX16,xTPR,PDCM,SSE4.1,SSE4.2,x2APIC,MOVBE,POPCNT,TSCDLT,AESNI,XSAVE,OSXSAVE,RDRAND> AMD Features=0x2c100800<SYSCALL,NX,Page1GB,RDTSCP,LM> AMD Features2=0x101<LAHF,Prefetch> Structured Extended Features=0x2294e283<FSGSBASE,TSCADJ,SMEP,ERMS,NFPUSG,MPX,PQE,RDSEED,SMAP,CLFLUSHOPT,PROCTRACE,SHA> Structured Extended Features3=0xac000400<MD_CLEAR,IBPB,STIBP,ARCH_CAP,SSBD> XSAVE Features=0xf<XSAVEOPT,XSAVEC,XINUSE,XSAVES> IA32_ARCH_CAPS=0x69<RDCL_NO,SKIP_L1DFL_VME> VT-x: PAT,HLT,MTF,PAUSE,EPT,UG,VPID,VID,PostIntr TSC: P-state invariant, performance statistics Done. .... done. Initializing.................. done. Starting device manager (devd)...devd: Can't open devctl device /dev/devctl: Device busy done. Loading configuration......done. Updating configuration.......................................................................................................^C.2019-01-15T22:19:41.788360-05:00 init 1 - - /bin/sh on /etc/rc terminated abnormally, going to single user mode
Updating configuration
... followed by a gazillion dots (until you see the process interrupted by^C
) I don't really know what's being updated here and how empty config files play into this process, but it's curious indeed. -
@sjm FWIW this occurred on a device running
21.02.2-RELEASE (amd64)
. It might only be something that happens with 2.5 and later. As many of these machines I have in production, and given how easily this issue occurred (for me), I'd say it's a bug in the newer versions of pfSense. -
@chamilton_ccn Do note that this isn't happening while booting, or changing the config.
Which makes it even more difficult to understand how files become 0 length.
Unless of course there is something automated running in the background. -
@netblues said in Pfsense 2.5 stacks at boot with dots:
@chamilton_ccn Do note that this isn't happening while booting, or changing the config.
Which makes it even more difficult to understand how files become 0 length.
Unless of course there is something automated running in the background.Yeah, that is strange for it to happen while the system is running. It's also something I haven't seen yet, which makes me a bit nervous. Just curious, did you actually find zero length config files in your situation? It's possible this issue has multiple causes; it just so happened that my situation was due to empty configs. Yours might be due to something else.
-
@sjm said in Pfsense 2.5 stacks at boot with dots:
Under 2.4 I'd had numerous power outages and brown outs and the system kept coming back every time without a hitch.
Same. The only time I've had a device not come back up is when there was a clear explanation as to why.
-
@chamilton_ccn I do remember seeing a zero length config.xml file.
-
@sjm said in Pfsense 2.5 stacks at boot with dots:
@chamilton_ccn I do remember seeing a zero length config.xml file.
Interesting! Do you know if the
/cf/conf/backup
directory was empty and/or whether the most recent backup config was empty? If you're uncertain and this happens again, definitely check that and report back!