Netgate 8200 Max pfsense+ v22.05.1 local disk boot after install.
-
Hello.
As a background, I've done a lot of pfsense+ installs on the Netgate 7100 devices with 22.05 without any issue.
I've now done a clean install of pfsense+ 22.05.1 on a Netgate 8200 Max and it's the first time I've done it on an 8200.
After install, at least on the 7100 devices, I can always just load the boot menu and choose the internal disk after installing and it bypasses the USB as expected. This avoids having to get someone physically near the device the pull the USB for initial boot. This doesn't seem to be working on the 8200.
I do realize the installer instructions say to physically remove the USB to reboot, however it also says this for the 7100 device and it works fine choosing from the boot loader. The reason I'm trying to avoid physically removing the USB for each of these installs (I have to do several of these) is because these devices are located globally and it can take days for an international support request to the local datacenter to pull the drive, then I "hope" it works, and they have to then reinstall the drive when I'm done again anyways (for ECL usage) and there is back & forth with the facility crews and it can take days just to work with the datacenter depending on their location and facility SLA and it costs my company money every time I open a remote hands request. I do not have the luxury of being physically present with all these devices as they are scattered around the globe in various datacenters.
Since using the boot order override through the boot menu has always worked fine on the 7100 devices I figured it would work here too.
Screenshot hopefully gives a good indicator of what I'm seeing. Selecting NVMe here just drops me directly back to this same menu again.
Here is the menu option after selecting < to show the path:
I can boot to USB with no issue and reinstall as needed. I've tried reinstalling several times and with different options.
I've tried using 100% default install options as well as trying ZFS with MBR just to see if it makes a difference. I didn't notice any errors but I'm on a serial connection so it's a little difficult to catch all messages.
Is it confirmed that the 8200 device cannot (regardless of the boot option) boot to local disk if the installer USB is physically installed? (this is a major pain but I'll bite the bullet if I truly have to) (This has always worked fine on the 7100 devices)
My concern is that there is maybe some underlying issue preventing booting from the local disk rather than it's simply that the fact the USB drive is still inserted. Any diagnosis hints or suggestions for this before I start engaging with on-site teams to help pull the drives temporarily? I fear that there is a bootloader issue on the local disk install and this issue I'm seeing is just obfuscating it and pulling the installer drive is just going to delay troubleshooting efforts for days.
I'm just looking for some insight or ideas on what I could try or diagnose remotely over a serial console without giving on-site teams the runaround. I can hop into an installer shell chrooted to the installed OS to help diagnose things if anyone has an idea of what I can try.
-
If you select the NVMe drive and it just goes back to the menu that implies it's an invalid UEFI variable. Which means it's probably the old entry from before the install.
I expect a new entry to be added at first boot. You should then be able to move the USB device entry below that valid NVMe entry to it boots NVMe by default.
If you have a config on the USB to pull in via the ECL that will get pulled in at every boot if you don't remove it.
-
@stephenw10 said in Netgate 8200 Max pfsense+ v22.05.1 local disk boot after install.:
If you select the NVMe drive and it just goes back to the menu that implies it's an invalid UEFI variable. Which means it's probably the old entry from before the install.
I expect a new entry to be added at first boot. [...]
Thanks so much for your reply!
Do you have a suggestion to repair this? I can clean install the OS again if that's helpful to write this UEFI entry.
EDIT: Ok so I tried removing that entry from the boot menu (it dropped of successfully). However now it doesn't show the internal device at all in the boot menu. I've tried rebooting and I tried another clean install using 100% default install options. I'm using
pfSense-plus-memstick-serial-22.05.1-RELEASE-amd64.img.gz
. All I can see in the boot device list is the USB and PXEYou should then be able to move the USB device entry below that valid NVMe entry to it boots NVMe by default.
The internal drive is already at the "top" of the list. I have tried selecting it manually (this returns immediately back to the boot menu again endlessly) and also tried letting the device "naturally" boot without selecting boot options (this will load USB installer).
If you have a config on the USB to pull in via the ECL that will get pulled in at every boot if you don't remove it.
I do plan to use this drive for ECL after everything is up and running but for now the USB drive is just used for installing pfsense. There is no config.xml on the drive currently for ECL, it's just being used for the pfsense install image.
-
@emmdee said in Netgate 8200 Max pfsense+ v22.05.1 local disk boot after install.:
The internal drive is already at the "top" of the list. I have tried selecting it manually (this returns immediately back to the boot menu again endlessly) and also tried letting the device "naturally" boot without selecting boot options (this will load USB installer).
I meant the new entry that I expect to be added after an install completes.
Is there any reason you're using 22.05.1?
What BIOS version is shown?
Have you tried resetting the boot menu options? That should then generate a new entry for the NVMe that is valid at the next boot.
-
@stephenw10 said in Netgate 8200 Max pfsense+ v22.05.1 local disk boot after install.:
I meant the new entry that I expect to be added after an install completes.
[...]
Have you tried resetting the boot menu options? That should then generate a new entry for the NVMe that is valid at the next boot.It re-scanned devices and now all I see is USB and PXE.
I guess this explains why I was unable to boot to the previously listed internal disk entry.Is there any reason you're using 22.05.1?
We have a fleet of over 40 Netgate devices (7100's running v22.05 until they were deprecated and now using 8200 which needs v22.05.1 according to support).
They are managed using using CICD processes in an automated fashion that's engineered to work with v22 configs. It's only vetted and compatible with v22 configs right now since it installs packages and v23.01 had completely hosed multiple new boxes when we tried to use it a while back (FRR package issues, repo issues, and a lot of other details I can't 100% recall) in non-prod environments.
Because of this we have not had the manpower to go through the upgrade to validate 23.09 yet to see if the 23.01 issues were fixed (since v22 works excellent in production). OS upgrades on a production fleet of 40+ firewall devices is not a trivial task and since we had issues with v23 it hasn't been on the schedule to revisit it. I'm really just trying to stand-up a new site and running into these issues. Even this device which shipped with 23.01 was throwing libc errors on first start.
What BIOS version is shown?
Sorry but I'm not quite sure where to see that or what hotkey is needed to get into the BIOS to check.
The first thing that pops on screen during a reboot is this, which doesn't appear to be a BIOS POST. I'm on a serial console.
isSecurebootEnabled = 01 secureboot not enabled Loading Usb Lens. Provider: Insvde Software Version: Product: 05.12.02.0040.0009.16 BlinkBoot (R) Harrisonville USB Lens Licensee: Silicom Ltd. Start: 03/20/2018 End: 12/31/2099 Validating Signature... [Pass] Installing Usb Lens..
Some additional notes....
Hardware problem?
Since the installer sees the disk fine I would doubt a hardware issue with the disk but I don't want to rule it out. I can also "browse" the filesystem chrooted into the installed system after the installer finishes without issue. I hypothesize that perhaps the previous partition structure (and maybe the efi config) isn't getting "wiped" upon a clean install?Trying different disk layouts:
I would like to fully wipe the disk clean and reinstall as perhaps something is left over from the 23.01 install that the device shipped with (which was quite broken with tons of errors such as "/usr/local/sbin/pkg: Undefined symbol __libc_start1@FBSD_1.7" and others).I also thought about wiping the existing partition structure using the UFS guided setup (I've been doing the ZFS default "auto" config during install so far). However when I get to the partition editor menua it lists my USB drive da0 so I was a bit scared of wiping the USB disk in the process....
For example, trying UFS guided disk setup ( I have been choosing "Auto (ZFS)" previously):
Here is where I'm worried about it wiping my da0 disk. Wiping the installer media would literally take days to recover it with on-site staff out on the opposite side of the globe.
Is it safe to proceed here and it won't re-partition my da0 device? My goal is to clean wipe and reinstall on nvd0.
Device ID for support:
There is PRO TAC support on this box but since it's not booting I'm not sure how to get the device ID. Is there a way to get that ID on a non-booting system? I can chroot into the installed OS filesystem after install completes if that helps to gather the ID. -
The bios version is shown in the dashboard or or the output of 'kenv' at the console. It should be
Version: CORDOBA-03.00.00.03t
. But you can only see that once it's booted.Yes, 22.05.1 was the first version that supported the 8200.
Yes you could safely install UFS from that screen but it won't help here. You should use ZFS anyway.
You have to boot it one time without the USB drive attached to generate NVMe UEFI variable. I didn't realise you are currently remote from that device?
The only other thing you might do there is to try manually adding an entry from the shell after the install has finished (or before it starts). For example:
When finished, type 'exit' to return to the installer. # efibootmgr -v Boot to FW : false BootCurrent: 0003 Timeout : 0 seconds BootOrder : 0002, 0003, 0001, 0000 Boot0002* bootx64.efi PciRoot(0x0)/Pci(0xb,0x0)/Pci(0x0,0x0)/NVMe(0x1,ab-bb-06-ae-18-3e-69-24)/HD(1,GPT,6f908b25-ca7f-11ee-b15f-90ec77475ce8,0x28,0x82000)/File(\efi\boot\BOOTx64.efi) nda0p1:/efi/boot/BOOTx64.efi (null) +Boot0003* bootx64.efi PciRoot(0x0)/Pci(0x15,0x0)/USB(0x2,0x0)/HD(1,MBR,0x90909090,0x1,0x10418)/File(\EFI\BOOT\bootx64.efi) da0s1:/EFI/BOOT/bootx64.efi (null) Boot0001* bootx64.efi PciRoot(0x0)/Pci(0x15,0x0)/USB(0x1,0x0)/HD(1,MBR,0x90909090,0x1,0x10418)/File(\EFI\BOOT\bootx64.efi) da0s1:/EFI/BOOT/bootx64.efi (null) Boot0000* PXE-0 Fv(e35c4a77-3d00-4337-a625-b980a9e00f6c)/File(\pxe_0.nsh) Unreferenced Variables:
This might take a bit of experimentation here to get a command but if you're able to see that output it should be possible.
Steve
-
@stephenw10 said in Netgate 8200 Max pfsense+ v22.05.1 local disk boot after install.:
You have to boot it one time without the USB drive attached to generate NVMe UEFI variable. I didn't realise you are currently remote from that device?
Ok thank you. It sounds like I just need to bite the bullet here and work with the datacenter staff. Using manual boot device selection "just worked" on v22.05 on the 7100 devices so I was hoping it would work on the 8200 with 22.05.1 as well.
Yes I'm many thousands of miles from this device, worst possible time zone difference actually :/
My fear was that it won't boot even after pulling it out then I need to reach out again to plug it back in for further troubleshooting (more back & forth).
The only other thing you might do there is to try manually adding an entry from the shell
Excellent suggestion. After installing, I tried to mount the efi partition in the installed OS shell to
/efi
(obviously I can't mount it to/
), and run some efibootmgr commands such as:efibootmgr -a -c -l /efi/efi/boot/BOOTx64.efi -L pfSense
Which did add entries but they are messed up and don't boot. The pathing looks wrong based on how it's currently mounted I guess.
Boot0000* pfSense HD(1,GPT,7070bbb9-d204-11ee-bce7-90ec7774d2e1,0x28,0x64000)/File(\boot\BOOTx64.efi) nvd0p1:/boot/BOOTx64.efi /mnt/efi//boot/BOOTx64.efi
It looks like the BSD version of
efibootmgr
doesn't take a-d
disk option like the LInux version does and just maps it from the mount point. Can't quite figure this one out but this would be IDEAL if I could get it to work, great suggestion, I'm just not quite there with it.Thanks again for the assistance. If you have any ideas with the manual boot manager entries I'm all ears as it would save me literally days worth of back & forth with remote teams since I will have many more Netgate 8200 Max devices to initialize in 2024.
-
Ok I was able to get this to work. The key thing here seems to be that after the install the existing mounts cause an issue.
So I reset the boot list from the BIOS boot device menu. It then reboots and boots the installer from USB.
I installed using the default values in 22.05.1 which are as ZFS to the NVMe drive.
Then it reboots back into the installer. This time select the rescue shell instead of installing:At that point there is no EFI entry for the NVMe drive:
# efibootmgr BootCurrent: 0001 Timeout : 0 seconds BootOrder : 0001, 0000 +Boot0001* bootx64.efi Boot0000* PXE-0
So mount the the EFI partiton from the NVMe drive, create an entry and mark it active:
# mount_msdosfs /dev/nvd0p1 /mnt # efibootmgr -c -l '/mnt/efi/boot/BOOTx64.efi' -L NVMe BootCurrent: 0001 Timeout : 0 seconds BootOrder : 0002, 0001, 0000 Boot0002 NVMe +Boot0001* bootx64.efi Boot0000* PXE-0 # efibootmgr -a 0002 BootCurrent: 0001 Timeout : 0 seconds BootOrder : 0002, 0001, 0000 Boot0002* NVMe +Boot0001* bootx64.efi Boot0000* PXE-0 # umount /mnt
At that point the EFI entries should be correct:
# efibootmgr -v BootCurrent: 0001 Timeout : 0 seconds BootOrder : 0002, 0001, 0000 Boot0002* NVMe HD(1,GPT,d02486aa-d686-11ee-a7f6-90ec77475ce8,0x28,0x64000)/File(\efi\boot\BOOTx64.efi) nvd0p1:/efi/boot/BOOTx64.efi (null) +Boot0001* bootx64.efi PciRoot(0x0)/Pci(0x15,0x0)/USB(0x1,0x0)/HD(1,MBR,0x90909090,0x1,0x20000)/File(\efi\boot\BOOTx64.efi) da0s1:/efi/boot/BOOTx64.efi (null) Boot0000* PXE-0 Fv(e35c4a77-3d00-4337-a625-b980a9e00f6c)/File(\pxe_0.nsh) Unreferenced Variables:
So reboot directly from there with
reboot
and it will boot into the install on the NVMe drive.Steve
-
@stephenw10 said in Netgate 8200 Max pfsense+ v22.05.1 local disk boot after install.:
Ok I was able to get this to work
What a legend! I have several more of these installs coming up this year so this'll save me so much time and back/forth with the facility teams so THANK YOU!! I'll give this a go on our next deploys.
-
@emmdee said in Netgate 8200 Max pfsense+ v22.05.1 local disk boot after install.:
After install, at least on the 7100 devices, I can always just load the boot menu and choose the internal disk after installing and it bypasses the USB as expected. This avoids having to get someone physically near the device the pull the USB for initial boot.
I do realize the installer instructions say to physically remove the USB to reboot, however it also says this for the 7100 device and it works fine choosing from the boot loader. The reason I'm trying to avoid physically removing the USB for each of these installs (I have to do several of these) is because these devices are located globally and it can take days for an international support request to the local datacenter to pull the drive, then I "hope" it works, and they have to then reinstall the drive when I'm done again anyways (for ECL usage) and there is back & forth with the facility crews and it can take days just to work with the datacenter depending on their location and facility SLA and it costs my company money every time I open a remote hands request.
Screenshot hopefully gives a good indicator of what I'm seeing.
Here is the menu option after selecting < to show the path:
I can boot to USB with no issue and reinstall as needed. I've tried reinstalling several times and with different options.
……
I'm on a serial connection so it's a little difficult to catch all messages.I'm just looking for some insight or ideas on what I could try or diagnose remotely over a serial console without giving on-site teams the runaround. I can hop into an installer shell chrooted….
Let’s ask one question: how You connect to the pfSense itself in data center:
- by SSH on a dedicated uplink;
OR - by “Special request for KVM access” to DC crew for extra payment?
Because of issues like this I ALWAYS prefer bare metal servers that have 2xPSU and BMC (iRMC, iRMC, etc…) on NON-DEDICATED (multiplexed) ETH.
And Yes, 2 physical uplinks better than 1 (even in same DC, even link aggregator switch are above in the same rack).Even TOC in this case may be decrease because less numbers of “Special Requests” to DC crew.
But You ALWAYS have FULL server MANAGEMENT, MONITORING through SSL serts, and even MOUNTING REMOTE STORAGE.
- by SSH on a dedicated uplink;