Router hanging after 21.05 upgrade
-
Morning
I upgrade my NetGate SG1000 to 21.05 last week and since it has been hanging and effectively crashing. I've had to power cycle it about 4 times.
It lasts about 48 hours, then I notice the web interface hangs and I can't ssh to it. BUT packets are still being routed and filtered.
About another 24 hours later, it hangs entirely and stops passing packets.
Any idea?
Thanks.
The last few lines of system.log just prior to the last crash (this morning)
Jul 6 07:27:03 home nginx: 2021/07/06 07:27:03 [error] 89464#100108: *1520 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.0.0.158, server: , request: "POST /widgets/widgets/interface_statistics.widget.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.0.0.1", referrer: "https://10.0.0.1/index.php" Jul 6 07:32:23 home nginx: 2021/07/06 07:32:23 [error] 89464#100108: *1522 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.0.0.158, server: , request: "POST /widgets/widgets/dyn_dns_status.widget.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.0.0.1", referrer: "https://10.0.0.1/index.php" Jul 6 07:37:43 home nginx: 2021/07/06 07:37:43 [error] 89464#100108: *1524 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.0.0.158, server: , request: "POST /getstats.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.0.0.1", referrer: "https://10.0.0.1/index.php" Jul 6 07:57:19 home nginx: 2021/07/06 07:57:19 [error] 89464#100108: *1526 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.0.0.158, server: , request: "GET /index.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.0.0.1", referrer: "https://10.0.0.1/status_interfaces.ph
I rebooted the router before it stopped passing packets.
The previous hang where it stopped passing packets and needed to be rebooted
Jul 3 08:33:48 home ppp[13666]: [wan_link0] Link: reconnection attempt 692 Jul 3 08:33:48 home ppp[13666]: [wan_link0] PPPoE: Connecting to 'Tangerine' Jul 3 08:33:57 home ppp[13666]: [wan_link0] PPPoE connection timeout after 9 seconds Jul 3 08:33:57 home ppp[13666]: [wan_link0] Link: DOWN event Jul 3 08:33:57 home ppp[13666]: [wan_link0] LCP: Down event Jul 3 08:33:57 home ppp[13666]: [wan_link0] Link: reconnection attempt 693 in 4 seconds Jul 3 08:34:01 home ppp[13666]: [wan_link0] Link: reconnection attempt 693 Jul 3 08:34:01 home ppp[13666]: [wan_link0] PPPoE: Connecting to 'Tangerine' Jul 3 08:34:10 home ppp[13666]: [wan_link0] PPPoE connection timeout after 9 seconds Jul 3 08:34:10 home ppp[13666]: [wan_link0] Link: DOWN event Jul 3 08:34:10 home ppp[13666]: [wan_link0] LCP: Down event Jul 3 08:34:10 home ppp[13666]: [wan_link0] Link: reconnection attempt 694 in 4 seconds Jul 3 08:34:14 home ppp[13666]: [wan_link0] Link: reconnection attempt 694 Jul 3 08:34:14 home ppp[13666]: [wan_link0] PPPoE: Connecting to 'Tangerine' Jul 3 08:34:23 home ppp[13666]: [wan_link0] PPPoE connection timeout after 9 seconds Jul 3 08:34:23 home ppp[13666]: [wan_link0] Link: DOWN event Jul 3 08:34:23 home ppp[13666]: [wan_link0] LCP: Down event Jul 3 08:34:23 home ppp[13666]: [wan_link0] Link: reconnection attempt 695 in 1 seconds Jul 3 08:34:24 home ppp[13666]: [wan_link0] Link: reconnection attempt 695 Jul 3 08:34:24 home ppp[13666]: [wan_link0] PPPoE: Connecting to 'Tangerine'
I have THOUSANDS of these messages.
-
GRRR... I can't post more information, the forum says my reply is spam??
"Flagged as spam by akismet"
-
And in another piece of information... my routing became unresponsive and I noticed that miniupnpd was using a lot of CPU.
miniupnpd had used 15minutes of cpu time since I rebooted 9hours ago, which seems excessive.
I've restarted the service and now its using no cpu...
-
I have now caught miniupnpd using a lot of cpu time several times. When I go and look at the upnp status (rules) there are none.
A restart returns it to normal.
-
OK, this is looking like
https://forum.netgate.com/topic/164178/upnp-broken-on-21-05/11
-
Unless you're seeing those errors in the routing log it may not be.
The logs you showed above are PPPoE failing to connect. Unrelated to UPnP but could cause miniupnpd to use far more CPU than usual.
Is the parent NIC linked? Does PPPoE succeed at all?
Steve
-
Yes, I saw the miniupnpd errors in my logs... but every time I try to paste them into this ticket, it is refused claiming its spam. Hence my GRRR comment up the chain :)
I've applied the fixed in the upnp-broken ticket and will see how it goes.