Issue with SG-3100 and 22.01? [Solved]
-
Hi,
I am getting a bit nervous today. I updated my SG-3100 on February 19th, 2022 to 22.01.
Previously it runs with 21.05.2 for months without any issue.
At March 14th I noticed that my internet connection was gone. SG-3100 answers to ping, but was not responsive to GUI or SSH. Only plugging of power restarts the box.Today, April 7th, the same issue occurs again. Ping was answered, but no Web GUI access or SSH was possible.
Because I need to back online fast (sitting in home office) I restarted the box again by power plug.But now I have some concerns that this is an periodic issue. Until yesterday there was nothing unusal with ressouces or any other value as far as I could see.
Next I will connect an older laptop to the serial console and run a terminal program to catch anything (if any).
Anyhow, this issue did not occur with 21.05.2, so anyone else who had such an issue?
After reboot a crash report contains only a single line without any valuable (my guess) information:
[07-Apr-2022 06:53:00 Europe/Berlin] PHP Fatal error: Unable to start pfSense module in Unknown on line 0
I noticed the issue at 08:10, so it seems that the crash was about 1.5 hours ago.
Thanks for any idea how to pinpoint the issue.
Regards
-
Nothing else logged?
Yes, I would connect to the serial console and see what shows there. Try restarting PHP from the console menu.
Steve
-
-
Yes, nothing else, this was all.
And now I know for sure that system crashes about 06:53am local time.
I am running periodic updates to a weather site and this stopped at 06:53 and continues after my reboot at 08:15am.
At this time no one accessed the pfSense, so I am wondering why php is crashed.Regards
-
With weeks in between incidents like that it could be something like RAM or disk exhaustion.
Check the graphs in Status > Monitoring. Do they show continuously increasing resource use anywhere?
-
I will keep an eye in the next time at this page.
Until yet I always checkech in the dashboard the usage, but this was always low:and did not increase unusal.
Regards
-
The monitoring graphs log that over time so should show anything like that still if it happened.
-
Upps, just seen that it stores over time. Clearly seen, the gap from this morning when services stops:
But usage is constant every time!?
Regards
Edit: And here is the first drop in March:
-
So no big CPU usage before it stopped logging. You can check other resources by changing the graph settings. I would definitely check memory.
Steve
-
Yes, seems that memory is slowly, but steadily decreasing over the days!
But if I see it correct, there was more than 65% free this morning!?
From today:
And in total since running on 22.01:
Regards
-
Hmm, still mostly free though. That wouldn't stop it responding.
-
Is this the Unit you Update Wireguard and got into the wrong branch issue?
If so, go for a Backup and clean reinstall.
-
Hi, I have the exact same problem with a 2100. I did a clean install of 2.6 about a month ago, and this problem has happened twice since then. I had no issues with the device before.
The GUI becomes inaccessible, SSH unreachable, cron jobs stop running etc., but routing still works to some extent (I can access my VPN server behind pfSense remotely).
I haven't been able to find anything relevant in the logs, so any pointers as to where to look would be appreciated.
I use pfSense at an SMB so this is an absolute deal breaker for me unfortunately. -
There is no 2.6 image for the 2100, which is aarch64, I assume you mean 22.01?
Some services remaining up whilst others fail is typical of something like RAM exhaustion though so check the same things. It can also be caused by a failing drive which then prevents any errors being logged.
Steve
-
@stephenw10: Yes, I meant 22.01, sorry. I don't see anything unusual with RAM usage. Is there anything in S.M.A.R.T. data that would indicate that the M.2 SSD is failing? Also, is there an M.2 SATA SSD (2242) that you'd "officially" recommend?
-
Not other than the one that was already fitted if it has one.
I would expect some errors in the SMART data if it is failing.
I might expect to see other errors logged also.
-
@stephenw10: It didn't come with an M.2 SSD, I installed one. Do you know what errors/indicators I should be looking for, specifically?
-
Not anything specific. Any errors are bad!
Try running the short test and check the results.
Logging the console output is usually the best way to diagnose a drive failure of you can since the system will often dump error output there that cannot be written to the system log.
Steve
-
@stephenw10: I've run a short test, but it doesn't show any errors. I'll reinstall to eMMC. I really hope it's the SSD and not a software issue.
-
My device has no SSD, only eMMC. Hopefully its not the hardware.
Next days I will setup my standby device from scratch and replace this unit when issue occurrs next time.Regards
-
I would also consider re-installing 22.01 clean and restoring your config to rule out any issues during the upgrade.
Steve
-
My 3100 booted once (if I recall correctly) after the 22.01 update, back in Feb. Since then, I have been forced to set the unit aside. Update completed fine, but afterwards, I cannot access from serial nor Web page.
I've tried to hard reset - via the reset button with no changes. Yet the boot process seems to complete, since the LED panel-indicators seem to proceed in a valid sequence.
I cannot access by any means.
Is there factory services?
I've opened a TAC-Lite ticket and am just surfing around the forum until a reply is received. -
You should be able to see the boot output from uboot at the serial console even if there is no OS installed so I would concentrate your efforts on getting that working. Once you can see the console it will probably be obvious what is preventing you connect to the webgui.
https://docs.netgate.com/pfsense/en/latest/solutions/sg-3100/connect-to-console.htmlSteve
-
@stephenw10 Appreciate the reply:
Closer inspection revealed th Serial connection to be only on the ;aptop side.
The 3100 has been successfully upgraded to the latest 22.01 and I left it unattended while packages were to be loaded, *even though WAN port reported 'no carrier' /'DHCP down'. *This is clearly isolated to the 3100 only as moving the WAN cabling to another outside facing device works perfectly.
But thinking that the DNS Resolver took some time to come up, it was believed that time is what would cure all.
This is not the case. After 3 hour elapse, Resolve is up but remains with no WAN port. And Gateway Monitoring Service will not start, remaining stopped.Thoughts ? / Help !!!
-
You have already opened a TAC ticket.
Ask for the recovery image, usually you will get a download link within 1 hour (during normal office hours).
Then you can install SG from scratch.Regards
-
Do you see link LEDs on the WAN port?
What does Status > Interfaces show?
It should link and show UP if it's connected to anything unless it's disabled.
Steve
-
I've always used recovery images for firmware upgrade and that was the case in the Feb time frame. I save the running config as a start, initiate a factory default load, run recovery and load the latest firmware, then restore the saved configuration. That process is the standard procedure since initial ownership.
As I said, I shelved the 3100 until a time when I could invest in the problem.
Status>Interfaces>WAN indicates
'no carrier' /'DHCP down'. And DHCP will not come up.
Activity light on the port is green/solid -
@gherkin-d said in Issue with SG-3100 and 22.01?:
Activity light on the port is green/solid
Even with no cable attached? That would be a bad port if so.
Steve
-
@stephenw10
Thats with Cabling !!
BTW, firmware upgrades via Recovery image includes the WAN cabling attached. -
So the WAN port just shows the left LED solid green when you connect a cable to it? Yet it shows no link in the status?
What does
ifconfig -vvvm mvneta2
show?Is it possible WAN is configured to use on of the other ports?
Steve
-
-
@gherkin-d mvneta2 is the WAN port :) mvneta1 is LAN.
-
Yeah the default config assigned mvneta0 as OPT so that might be expected to show as no carrier.
-
I did not install the SG-3100 from recovery image yet, just being curious if anything is logged at console.
I expect the next crash at May 1st or 2nd.
Up to now the SG crashes every 23/24 days.
A PC is connected to serial interface.But anyhow, I have two more questions: as screenshot shows there is something creeping which decreases memory, any idea how to pinpoint this?
Second question: I noticed that when using this view I need to login again to pfSense after x hours (not sure about the exact value).
When in dashboard view I keep logged in for days!?
Seems that the duration for session depends to the view, is that correct?Regards
-
Check what processes are using RAM in Diag > System Activity or run
top -aSH
at the CLI.The dashboard has a number of active items on it that update periodically keeping the session open. On static pages the session times out after a while.
Steve
-
other commands you can use to show top memory-consuming procs:
show top 10 memory consumers
# ps -Am -opid,pmem,pcpu,rss,vsz,args | sort -k4 -rn | awk 'NR == 1 { print " PID %MEM %CPU RSS VSZ COMMAND" } NR > 1 && NR < 12 { print $0 }'
If you want a continuously updating display
# while :; do clear; ps -Am -opid,pmem,pcpu,rss,vsz,args | sort -k4 -rn | awk 'NR == 1 { print " PID %MEM %CPU RSS VSZ COMMAND" } NR > 1 && NR < 12 { print $0 }'; sleep 1; done
or use
top
# top -o res
-
-
-
This morning it happens again...
From one minute to the other, I was just noticing that accessing a web site was not possible any longer (404 - Not found).
Ping works to all addresses (i.e. 1.1.1.1, 8.8.8.8 or any other IP).
But ping to a name will not work, so DNS service was not doing.
After a few minutes WebGUI from SG-3100 was unreachable too.After last issue I connected an old laptop at SG-3100 with serial connection, so I looked up for any console output and ... the laptop was in a deadlock as well!???
Very strange... need to reboot laptop and after looking up in Putty log I rebootet the SG-3100 by power cycling. There was nothing seen in log, nothing means really nothing since last reboot! Not a single character.
An existing VPN connection into company network was still working (until reboot of SG).
No idea what this is. Will use the replacement hardware during the next weekend.Regards
-
Hmm, nothing appeared at the console at all?
At the very least you should have see webgui logins shown there. If nothing at all was shown it sounds like it was not logging.Your description of the issue really 'feels' like a failing drive. That's exactly how it presents. Except that eMMC failures generally don't recover at power cycle.
If you have an m.2 sata SSD you could try that instead.
Steve
-
Hi,
no M2.SSD yet, but I will get one in the next days.
And no, me too was surprised, I looked in the last weeks from time to time at serial console, nothing was shown there. Sometimes I just pressed the ENTER key to see, if I am still connected to console, it was, but nothing was recorded since startup.Regards
-
Hmm, well try just logging into the webgui to check. That should be shown, for example:
*** Welcome to Netgate pfSense Plus 22.05-BETA (arm) on 3100 *** WAN (wan) -> mvneta2 -> v4/DHCP4: 192.168.126.11/24 LAN (lan) -> mvneta1 -> v4: 192.168.18.1/24 OPT1 (opt1) -> mvneta0 -> v4/DHCP4: 192.168.21.10/24 0) Logout (SSH only) 9) pfTop 1) Assign Interfaces 10) Filter Logs 2) Set interface(s) IP address 11) Restart webConfigurator 3) Reset webConfigurator password 12) PHP shell + Netgate pfSense Plus tools 4) Reset to factory defaults 13) Update from console 5) Reboot system 14) Disable Secure Shell (sshd) 6) Halt system 15) Restore recent configuration 7) Ping host 16) Restart PHP-FPM 8) Shell Enter an option: Message from syslogd@3100 at Jun 3 16:43:12 ... php-fpm[656]: /index.php: Successful login for user 'admin' from: 172.21.16.5 (Local Database)
Steve
-
I have the same problem on my SG-3100 running 22.01. Roughly every 4-5 weeks the device freezes. I can ping the device/gateway address but not traffic goes through the WAN interface. DNS resolver/web GUI etc. does not work or is non-reachable.
I have a RasberryPi connected to it via USB-serial console and recording all the output using GNU screen to a text file. Nothing is recorded when this happens. Nothing particular is shown in boot log either. My knowledge of all the log files is however limited, so I might be missing something.
I ran S.M.A.R.T. tests on my M.2 SATA drive and no errors are shown. I see the similar memory graph as previous poster.
Any ideas or suggestion what to check next would be greatly appreciated.
Should I perform a re-install and restore config from backup maybe?
Register support ticker to Netgate?