2.4.4 & MBT-4220: fresh install then config.xml recover yields random UI crashes
I recently purchased the Minnowboard solution from Netgate and rehosted my old Dell Optiplex 160 mini desktop to it.
To the point: while everything seems fine, after restoring config.xml to the Minnowboard, I trigger complete system crashes immediately after bootup by clicking random links (mainly the top menu, like OpenVPN, interface assignments, System Log, etc) in the UI or refreshing the dashboard. A crash constitutes all interface being unresponsive. Nothing shows up in system.log or /var/crash either. And nothing like this happened on the previous Dell system.
However, if I don't touch the UI, the system just works.
This issue cannot be reproduced when using a fresh install. Nothing odd happens until after I restore the config.
Steps to reproduce:
- install 2.4.4 from the USB key provided by Netgate to the MBT-4220
- let the new install run for a bit, connect to LAN & WAN, run some tests by clicking in the UI. All is OK.
- Restore config.xml from the old system and reboot. While only LAN is connected, nothing bad happens.
- Halt the system.
- Take it to the wiring closet, connect WAN, and bootup. As long as I don't use the UI, everything is great. But as soon as I start exploring, a single click could crash the system rendering it unresponsive.
One difference I noticed between working and not-working (regarding UI causing crashes) is that these crashes do not happen if the WAN interface is disconnected. I could have the config restored, reboot, and unable to reproduce the issue. But when I power down the system, take it to the WAN connection, and power-up, I can reproduce the issue.
Any advice here, please? How can I get some debug output and troubleshoot the issue?
- other Minnowboard owners: do you find the fan to be too loud? By loud I mean, it is in a closet in my house , and I can hear it when I walk by.
- Maybe this would not happen if I restored config.xml during install. Is the only option to mount this file from a UFS partition? When I run mount in the rescue shell, I only see UFS as an option for file system types.
Do you have any packages installed?
Maybe the crash is not caused by your settings but some package which gets installed after the config.xml restore?
Yes, I think it is related to packages as I do have a few installed (OpenVPN client profiles, Squid, and their dependencies).
One thing that comes to mind from when I restored the config, was that after the first restart it mentioned re-installing the packages, please ignore this message if it lasts for a few hours (IIRC), but the box was not connected to the Internet. I did not connect WAN. When I look at the UI post-restore, it says no packages are installed and they're not listed under Services.
And I confirmed again today that the same crashes do not happen if WAN is disconnected.
So I think we can surely say that the packages, or lacckthereof, have broken the installation.
What would you do next? Is there a way to easily repair this after the config was restored or perhaps could I restore the config from XML again while the box can reach the package repositories?
You can try to wipe all Package related stuff out of your Backup config.xml, restore to your new Unit and then reinstall package by package with some testing between them.
I have ruled out packages being the root cause by restoring a config without any package information. I'm still getting crashes in the UI regardless of where I click. The box is mostly stable as long as I don't login.
So the problem might be coming from the contents of the config.xml file.
I'm starting to think of cutting my losses by starting from scratch and rebuilding my config manually.
I thought config restores would be more stable than this. Are there things one should look out for when doing this? Is it common to have issues with config restores?
Is there a way to setup debug logging persistently so that I can find the root cause when the UI crashes?
After spending time on this that I'd rather have back, I found that the root cause could have been any one or all of the widgets on the dashboard. After removing all the widgets on the source host and then redeploying the "All" configuration on the new host, the issue went away. The crash seemed to be triggered by clicking any link on the UI while the dashboard was present.