Router hanging after 21.05 upgrade

sdm900

Morning

I upgrade my NetGate SG1000 to 21.05 last week and since it has been hanging and effectively crashing. I've had to power cycle it about 4 times.

It lasts about 48 hours, then I notice the web interface hangs and I can't ssh to it. BUT packets are still being routed and filtered.

About another 24 hours later, it hangs entirely and stops passing packets.

Any idea?

Thanks.

The last few lines of system.log just prior to the last crash (this morning)

Jul  6 07:27:03 home nginx: 2021/07/06 07:27:03 [error] 89464#100108: *1520 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.0.0.158, server: , request: "POST /widgets/widgets/interface_statistics.widget.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.0.0.1", referrer: "https://10.0.0.1/index.php"
Jul  6 07:32:23 home nginx: 2021/07/06 07:32:23 [error] 89464#100108: *1522 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.0.0.158, server: , request: "POST /widgets/widgets/dyn_dns_status.widget.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.0.0.1", referrer: "https://10.0.0.1/index.php"
Jul  6 07:37:43 home nginx: 2021/07/06 07:37:43 [error] 89464#100108: *1524 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.0.0.158, server: , request: "POST /getstats.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.0.0.1", referrer: "https://10.0.0.1/index.php"
Jul  6 07:57:19 home nginx: 2021/07/06 07:57:19 [error] 89464#100108: *1526 upstream timed out (60: Operation timed out) while reading response header from upstream, client: 10.0.0.158, server: , request: "GET /index.php HTTP/2.0", upstream: "fastcgi://unix:/var/run/php-fpm.socket", host: "10.0.0.1", referrer: "https://10.0.0.1/status_interfaces.ph

I rebooted the router before it stopped passing packets.

The previous hang where it stopped passing packets and needed to be rebooted

Jul  3 08:33:48 home ppp[13666]: [wan_link0] Link: reconnection attempt 692
Jul  3 08:33:48 home ppp[13666]: [wan_link0] PPPoE: Connecting to 'Tangerine'
Jul  3 08:33:57 home ppp[13666]: [wan_link0] PPPoE connection timeout after 9 seconds
Jul  3 08:33:57 home ppp[13666]: [wan_link0] Link: DOWN event
Jul  3 08:33:57 home ppp[13666]: [wan_link0] LCP: Down event
Jul  3 08:33:57 home ppp[13666]: [wan_link0] Link: reconnection attempt 693 in 4 seconds
Jul  3 08:34:01 home ppp[13666]: [wan_link0] Link: reconnection attempt 693
Jul  3 08:34:01 home ppp[13666]: [wan_link0] PPPoE: Connecting to 'Tangerine'
Jul  3 08:34:10 home ppp[13666]: [wan_link0] PPPoE connection timeout after 9 seconds
Jul  3 08:34:10 home ppp[13666]: [wan_link0] Link: DOWN event
Jul  3 08:34:10 home ppp[13666]: [wan_link0] LCP: Down event
Jul  3 08:34:10 home ppp[13666]: [wan_link0] Link: reconnection attempt 694 in 4 seconds
Jul  3 08:34:14 home ppp[13666]: [wan_link0] Link: reconnection attempt 694
Jul  3 08:34:14 home ppp[13666]: [wan_link0] PPPoE: Connecting to 'Tangerine'
Jul  3 08:34:23 home ppp[13666]: [wan_link0] PPPoE connection timeout after 9 seconds
Jul  3 08:34:23 home ppp[13666]: [wan_link0] Link: DOWN event
Jul  3 08:34:23 home ppp[13666]: [wan_link0] LCP: Down event
Jul  3 08:34:23 home ppp[13666]: [wan_link0] Link: reconnection attempt 695 in 1 seconds
Jul  3 08:34:24 home ppp[13666]: [wan_link0] Link: reconnection attempt 695
Jul  3 08:34:24 home ppp[13666]: [wan_link0] PPPoE: Connecting to 'Tangerine'

I have THOUSANDS of these messages.

sdm900

GRRR... I can't post more information, the forum says my reply is spam??

"Flagged as spam by akismet"

sdm900

And in another piece of information... my routing became unresponsive and I noticed that miniupnpd was using a lot of CPU.

miniupnpd had used 15minutes of cpu time since I rebooted 9hours ago, which seems excessive.

I've restarted the service and now its using no cpu...

sdm900

I have now caught miniupnpd using a lot of cpu time several times. When I go and look at the upnp status (rules) there are none.

A restart returns it to normal.

sdm900

OK, this is looking like

https://forum.netgate.com/topic/164178/upnp-broken-on-21-05/11

stephenw10

Unless you're seeing those errors in the routing log it may not be.

The logs you showed above are PPPoE failing to connect. Unrelated to UPnP but could cause miniupnpd to use far more CPU than usual.

Is the parent NIC linked? Does PPPoE succeed at all?

Steve

sdm900

Yes, I saw the miniupnpd errors in my logs... but every time I try to paste them into this ticket, it is refused claiming its spam. Hence my GRRR comment up the chain :)

I've applied the fixed in the upnp-broken ticket and will see how it goes.