SG-1100 Won’t Reboot on Upgrade - no internet access!
-
I thought I opened a TAC ticket late last night, but had left the form up so I could get the SN and other info from my box. So I filled that in and sent it in today - maybe an hour ago, maybe longer.
I'm back to trying to reach the servers. I've deactivated the LAN and trying it over and over.
I'm wondering if there might be a reason why it only took a few retries in the early morning (US Eastern time) and during the day it's just not connecting.
Again, I see the LEDs flashing on the RJ45 and it doesn't complain about the NIC being inactive or anything.
This is the part where I wonder if a different IP address would help.
-
I had disabled the LAN and it couldn't reach the servers. Enabled it and it did, first try. Then I realized I forgot to put in the blank USB stick in the USB3.0 socket, so I had to go back and restart. Again left the LAN on and it went through first time. So it's formatting and preparing to install to the USB stick.
A thought on that: While I have a new SG1100 coming in next week, I'm wondering if, once I get it working on the USB stick, it would be easy to copy or clone that system to the main drive and see if it works on there.
Ah - it's fetching and stuff now. So I guess I can take a break and get one or two things done while it spends time doing that.
-
You had to assign it? Or it detected it?
You will have to set LAN as none or chnage it's subnet in the installer to avoid a conflict there.
-
@stephenw10 said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
You had to assign it? Or it detected it?
First I went through and specifically picked "None" or whatever the option was to not detect or use it. And it wouldn't connect to the servers.
Then I canceled and let the install restart. When it got there, I just hit <return> and let it keep the values. Then it connected to the server without a problem - two times. (I had to do it a 2nd time so I could plug in the USB stick I wanted to install it on.)
@stephenw10 said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
You will have to set LAN as none or chnage it's subnet in the installer to avoid a conflict there.
It seems to be working without that. It's got the LAN set up (as I said, I just hit <return>). But this is with the initial install at this point, where I can't touch the subnets.
Is this something Netgate should look into, since at least one ISP now is forcing a 192.168.1.xxx address space? Starlink is often a "last choice" when it's the only choice and they're all over the US and Canada now and I think in many other countries worldwide, so I would think this could become an issue.
-
Yes in retrospect pfSense should probably have used a different default subnet. The problem now is that it's been that for so long changing it would cause confusion at best.
But we are aware of the issue and you should be able to set it in any install situation you find.
-
@stephenw10 said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
Yes in retrospect pfSense should probably have used a different default subnet. The problem now is that it's been that for so long changing it would cause confusion at best.
I get that - and there was no way of foretelling the arrogance of Starlink and their decision in what was, at the time, years in the future.
BUT
I wonder how hard it would be to check the WAN during install and, if the address it's been given is in that default subnet for pfSense, offering the user a choice to pick a different subnet to use. Also, this is during the install, not when it's configured, so, perhaps, switching to a different subnet just for the install, then switching back to the default at the end of the install might work. At that point, once it's done what it has to do during the install (or even during the post-reboot configuration), it could change back to the default subnet OR offer the user the choice.
@stephenw10 said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
But we are aware of the issue and you should be able to set it in any install situation you find.
Well, somehow, it's working for me, but it's not something that's changeable during the install itself, which is when I've been having the problem.
-
Made it through all the past install crashes. It's still extracting packages, which is taking a good while, but it's working on a USB stick, which is going to be much slower than the internal drive. It's well past the boost-libs-1.85 package, which was the child process that was always killed. (That was package 53/177. It's on the last package now.)
-
I'd be happier if I had not already seen this and had to go back! (Still, it's progress.)
-
Even better!
It never asked me for any LAN info or to connect to the internet or anything like that. All the stuff I dealt with before are no issue now, at all. Only thing question it asked before this menu came up was to set my password.
And notice it's using the same address space on both NICs. Well, we'll change that when I upload my config through the web.
I have to leave for an appointment soon, so if I don't post an update, what I'm going to do now is to halt the system, disconnect the WAN connection, connect the LAN directly to my Mac, and upload my backed up configuration.
I'm surprised that if I had just left the LAN interface defined during setup, this time things would have gone smoothly. Yesterday there were so many wacky things that didn't make sense, it's like a dream. I had problems with serial connections, with it booting, and other issues I reported. I'm wondering if a lot of them could be explained by failing internal storage. That could have led to problems booting, reading settings and info, and maybe more.
-
Just this part to show the web config is up and running AND it's using my config info, with my LAN address space!
I was hoping to take it downstairs and plug it in where it belongs and quickly check to see if things are behaving, but I won't be able to now, since it has to finish reinstalling packages.
I've got an appointment and about 90 minutes of driving coming up (to get there and back), so that'll give me time to mentally review everything. I want to list the issues that came up and see if it's possible to figure out why they weren't a problem today when they were so hard to deal with yesterday.
-
Well looking much better at least!
-
Short version: It's all working fine now.
Also, I want to thank everyone who chipped in and helped along the way. Some people spent a lot of time on this thread writing helpful comments and suggestions and reading through a lot of my details to help me work things out. That is deeply appreciated!
This may be long, but I'm trying to be thorough. I think pfSense is a strong program and I've been using it for so long I don't remember when I first used it. I want to say it was somewhere around 2005, but I'm not sure. First I used the open source version on a Soekris net5501 (if that's the right model number) for many years, then switched to the SG1100 5 years ago. If any of this rotten experience helps improve any part of the program, I'll be glad it helps.
Yesterday (Thursday) was horrendous and I feel like I went through a worst-case restore situation. The only two ways it could have been worse is if my SG1100 completely crashed and I didn't have a config backup or if I had not backed up the config before it went bad. I think it's important to stress that it did not just crash on its own. It crashed as I was upgrading, not when I tried something tricky or experimental.
I've been thinking through what happened and what could have made it easier for me to restore my device to functionality. I can see why there is no way to do a factory reset. I do wish that the boot firmware had ssh included so there would be a way to connect without a serial cable, but I understand there are probably reasons against that.
So that leaves my experience and the things that went wrong (or the things that were good). I think, since so much went wrong, this is a good case study for Netgate, since almost anything that could go wrong went wrong.
Issues I faced that can likely be fixed:
- With
usbrestore
, I ran into a problem when the device was being deblocked (unblocked? not sure of the term or the exact message. I think it's in a screenshot or comment upthread). The message is not as clear as it could be and says something like, "Device being deblocked." It takes time, so it's hard to tell if it's doing something or if it just got hung up. As a user, this is confusing. Is it hung up or working in the background? Pressing any button apparently (in my experience) terminates the process, leaving the device in an unknown condition. From there it's not possible to proceed without running it again and facing the same issue. Suggested fix: Add a line at the start saying, "(This could take several minutes.)" Since Netgate knows what devices they have and that this will run on, it might even be possible to include a time, like, "This may take up to 5 minutes." Another way to handle it is to include a counter on how many blocks have been zeroed out or some other status information that changes while the program is running. The purpose is to let the user know the program didn't hang, it's just busy. Also, if the user presses a button, rather than quit, the program could prompt with something like, "Still unblocking device. Abort? (y/N):"
I think this issue (including the deblocking aborting on a keypress) added several hours to my restore process.
- Default address space on LAN can conflict with WAN. In my case, with Starlink as an ISP, I have no way to change the address space on the WAN side. It's far less than optimal for Starlink to force the 192.168.1.xxx address space, but they do. It's the same address space pfSense defaults to. @stephenw10 and I have both expressed concerns about this. However, when things finally worked well, it didn't seem to be an issue. (Oddly, it was a nightmare on Thursday, but when I tried doing things again on Friday, it wasn't an issue - and I cannot figure out what was different. On friday, when things worked, at at least two points, pfSense reported the address for both the WAN and LAN they were both in the same address space. How that worked at all is beyond me. Suggested fix: Before connecting to the Netgate servers, check the WAN IP address. If it's in the 192.168.1.xxx address range, give the user a choice of using a different address space on the LAN. It may be better, though, to not give a choice. Since the LAN connection is not used at all until after the reboot, if the WAN is using this address space, either automatically deactivate the LAN NIC or change to another address space. Then, before rebooting, change back to the default address space. This way there is no conflict and the user doesn't have to deal with the issue at all.
I don't know what went on and why it took hours and dozens of tries to connect to the Netgate servers, but this issue probably added 4-5 hours to my restore process. (And I'm not exaggerating on the timing!)
The last time I tried the install, almost everything was perfect - what wasn't has been addressed in the issues I list below. But for some reason, it just would NOT work properly at all for hours and hours when I first tried everything.
-
I had an install of v14.11 and v14.03 both crash at the same point (discussed upthread). Basically it was during expansion of downloaded packages. I suspect it was a driver failure. I don't know where the packages are downloaded to and where they are unpacked. Just in case, I tried a hack and made my own USB restore drive by formatting a 256GB USB stick in FAT32, then copying the files from a net install image on a USB stick onto my stick. That provided a lot of free space. It may have been coincidence, but the failure message (about a background process having to be killed) was no help. My guess was that it was a storage problem and, again, maybe the sign of a failing drive. (How long do the drives in an SG1100 tend to last? Should I just plan on replacing these units every so many years?) During the download and unpacking, it would be nice, if things fail, if the drive space was checked and reported in any abort or error messages to indicate if the problem could be storage.
-
Restoring with my configuration file lead to multiple issues. (Discussed upthread.) There were reports from the post-install configuration (after reboot) about 2 interfaces that, apparently, are not even on the SG1100. I'm not sure what was going on here, but my configuration file was the one I downloaded from the SG1100 before I started my failed upgrade and it was loaded from the web interface and installed without issue. I have every reason to believe it's a good configuration file, but the issues that showed up because of me trying to restore with it cost me 2-3 hours. When I tried the same steps, but without the configuration file, I finally got a good restore. (It just occurred to me that I use Tailscale. I don't know if that creates virtual devices, but maybe that could be part of what caused this.)
-
Include a section in the dos about restoring to a USB stick. I finally decided to do that and things worked perfectly. That may be the one thing that fixed all the other problems. While the only 2 extra steps (setting the env options) are simple, it would be worth it to have a section on that in the docs. This post says a lot of what's needed, but having it in the docs would prove helpful to a lot of people.
Intermittent or Tricky Issues that May Not be Debuggable
-
I've mentioned several times that I think the issue could be that my drive is failing. That would explain several issues, maybe even explain all of them. I think it would be quite useful to add an option to the boot menus to make it easy to run fsck to verify the system storage devices. Maybe even add a prompt when the Marvell shell comes up about what command can be run to verify device integrity. This would be incredibly helpful, since most of us may not be familiar with just what the underlying OS is (yes, it's BSD, but a lto of users are inexperienced with BSD and don't know what tools are available on it). Some kind of prompting or including a menu item to verify drive integrity would be a major help.
-
Flakey serial connections: Connecting with the serial console is easy on Mac and Linux, just
screen <device node> 115200
. It's almost foolproof, but at one point, when I brought the device up to my study to make it more comfortable to work on it, I tried booting over and over and often saw gibberish on the serial connection. There were times it was a matter of characters not showing up, so I might get something like "eger" instead of "Netgear." I also had times when the serial connection failed altogether. This could have been because of my cable, but I also noticed that there were a number of times that adjusting the cable connection on the SG1100 fixed things. I think the position of the USB-B microconnector on the motherboard and the thickness of the case may create an issue where the ends of some USB cables aren't long enough to fit into the connector well. Making a case with an indentation around the connector or putting it just a little bit closer to the edge of the PCB might fix this. I have no idea , though, what caused the flakey serial connection most of the time, though. -
Boot issues: I had multiple times when I tried to boot and got a message that it was trying to boot and I saw a "T" (or, if I remember, sometimes an asterisk) on the serial console, then a space, and, after a wait, another T (or asterisk). I don't know what was going on, but this took time and never was followed by a proper boot. At one point I was having serious trouble for a while getting it to boot and provide a non-garbled serial connection. I think the boot issue could be explained by a failing storage device, but I don't know if the serial connection issue could be.
I've spent a lot of time thinking this through and I hope it helps the devs at Netgate.
- With
-
@TangoOversway there is this?
https://docs.netgate.com/pfsense/en/latest/backup/restore-during-install.html#restore-configuration-from-media-during-installThere are various threads about eMMC storage. TL, DR:
https://docs.netgate.com/pfsense/en/latest/troubleshooting/disk-lifetime.html#emmc
Packages that require/recommend SSD:
https://www.netgate.com/supported-pfsense-plus-packages -
@TangoOversway said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
With usbrestore, I ran into a problem when the device was being deblocked (unblocked? not sure of the term or the exact message. I think it's in a screenshot or comment upthread). The message is not as clear as it could be and says something like, "Device being deblocked." It takes time, so it's hard to tell if it's doing something or if it just got hung up. As a user, this is confusing.
Do you mean where you reported seeing?:
mountroot> random: unblocking device.
And did it just repeat
random: unblocking device
for some time?That shouldn't happen in a normal boot. It may have to wait a second or two for the root drive to become available but that's it. If you see the
mountroot>
prompt that means it's tried to mount root and failed leaving you at the prompt. Some other background process is spamming theunblocking device
message but effectively it is still waiting for input at the prompt.So if you see that it's usually not possible to continue without manually mounting root which the user is never expected to do.
I would guess that was a USB drive it was having problems with.
-
@SteveITS said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
@TangoOversway there is this?
https://docs.netgate.com/pfsense/en/latest/backup/restore-during-install.html#restore-configuration-from-media-during-installI was using that. I had moved my config file to my restore USB and imported it during the install process. I made one mistake, since I assumed it would see my backup config file, and didn't realize it had to be specifically named "config.xml". The filename pattern the backup feature uses will not be recognized. But my issue came later, after install and reboot. It had issues with interfaces that, apparently, shouldn't have been there (and weren't in my config file). So for some reason, it had a problem with a legitimate config file that came from a pfSense backup and was later used successfully to restore my new install to my configuration.
@SteveITS said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
There are various threads about eMMC storage. TL, DR:
https://docs.netgate.com/pfsense/en/latest/troubleshooting/disk-lifetime.html#emmc
Packages that require/recommend SSD:
https://www.netgate.com/supported-pfsense-plus-packagesCome to think of it, it didn't occur to me that some packages won't work without the SSD. I do see that my device is notably slower in booting and in the web UI with it running from the USB stick, but I expected that, since a USB is always slower. (What didn't occur to me, and that I've never tested, is to use a microSD card on a USB adaptor and see if that's any faster than a USB stick. I don't know if it's a USB bottleneck or if the issue is a different type of memory in the device.)
@stephenw10 said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
Do you mean where you reported seeing?: mountroot> random: unblocking device.
Yes. For the user, there's no indication that it can take time to do the unblocking, so it looks like it might have just frozen. And, as I mentioned, in my experience, a single keypress stopped the command. If the keypress didn't stop the command, then the command encountered an error and did not report it until my keypress. Either way, I, as the user, was not well informed on what was happening in that situation.
@stephenw10 said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
And did it just repeat random: unblocking device for some time?
It didn't do it one time after another during the same attempt, but in repeated reboots. I was thinking there were other issues, but your next comment I quote, below, responds to some of that.
@stephenw10 said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
That shouldn't happen in a normal boot. It may have to wait a second or two for the root drive to become available but that's it. If you see the mountroot> prompt that means it's tried to mount root and failed leaving you at the prompt. Some other background process is spamming the unblocking device message but effectively it is still waiting for input at the prompt.
So
mount root>
is a prompt - but therandom: unblocking device
is not. Okay, that was confusing, but it explains why a keypress seemed to interrupt what I thought was an unblocking process. I had worked out, in my head, when I kept seeing that over and over, what I thought was going on. so now I'm trying to remember what happened with this new information in mind. It sounds like there already was an error or just hitting <return> (which I often did at that point) generated an error at that prompt. I forgot what I'd get, but it might be in one of my screenshots.@stephenw10 said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
I would guess that was a USB drive it was having problems with.
That was, if I remember correctly, after I would type
run usbrecovery
. That seems to mean that it was either having trouble using the files on the USB stick or had already started using them. And if it saysmount root>
, does that mean it did some kind ofpivot_root
to change the root to the USB stick? -
Separate from my responses already posted:
-
Now that I got it to restore properly, and another SG1100 is due to arrive Monday, I'd like to set this one up to use as a backup if the other fails.
-
Since I finally got it working properly, I would think I could power down, pull the USB stick it's running from now, and see if I can get it to restore to the internal SSD. As I mentioned many times, I am wondering if that drive may have gone bad. Is there anything in this thread that indicates that it's likely the internal storage is bad? (Also, if I can restore pfSense on the internal storage, it simplifies storage. I don't have to worry about the USB stick becoming separated from the SG1100 or being broken if something on the shelf falls on it and pushes on the USB stick and messes up the connector or the stick itself.)
-
Home Assistant has a nifty add-on that will back up the configuration to a Samba share on the local network. That's a nice safety feature, since it means if the HA system is borked, all that's needed is to just setup a new system and reload the config from the Samba share. Is there anything like that with pfSense, where it will automatically save a config file to an NAS or other external storage regularly?
-
-
@TangoOversway sg1100 doesn't have a ssd, it is a emmc is not?
https://shop.netgate.com/products/1100-pfsense
Storage: 8GB eMMC storage. -
One other MAJOR issue:
I can see this as a major security issue and I was so focused on just getting things working, I kept forgetting to bring it up.
When I was trying to get my firewall back online, there were times when it was connected to my LAN and the WAN, as normal. This was while I was trying to connect to the Netgate servers. The Starlink router was aware of devices on my LAN! So during setup, there was a direct network pass-through on my SG1100! On the Starlink mobile app, it's possible to connect with my router, even remotely (not through wifi, but through my router's communication with Starlink ISP). When I did this, it listed all my smart TVs, my desktop computers, my Home Assistant systems, my Sonos speakers, and a number of other devices on my LAN.
I don't think this is as much a threat in my situation, since I still had that router between my LAN and the internet and most Starlink users will use that router as their only firewall. Also Starlink uses CGNAT, so unless something on my LAN is phoning home for malware (which it might be able to do with pfSense anyway), it's not like someone could penetrate through the CGNAT. But if my firewall was the only safety device between my LAN and the internet, it would have been a security nightmare.
-
@TangoOversway said in SG-1100 Won’t Reboot on Upgrade - no internet access!:
aware of devices on my LAN! So during setup, there was a direct network pass-through on my SG1100!
Not normally going to be possible unless you had a bridge setup or connected your starlink to a lan port and the rest of the network to another lan port..
But not sure all the things you did during your setup of pfsense. But those ports are all part of the same switch in the 1100 I do believe, not discrete interfaces.. So its possible you had to ports in the same L2 and then yeah devices on one port would be able to "see" devices connected to the other port.
So sure during your setup is possible all those ports were the same L2.
-
automatically save a config file
There is this: https://docs.netgate.com/pfsense/en/latest/backup/autoconfigbackup.html
Re it being a switch if unconfigured, that’s specific to the 1100. Didn’t look but perhaps Netgate could add a note to the reinstall directions to say to consider disconnecting from LAN and OPT during reinstall if it’s not there.