Check_Reload_Status 100% CPU Again Again

Mission-Ghost

Jan 25 I reconfigured my dual-wan network, moved some switches and cold booted my SG 1100 running 23.09.1 to move it to a different UPS. I did not change the pfSense configuration. All this work was done before the evening.

About 8pm on the 25th check_reload_status began consuming 100% cpu and didn’t stop until I noticed it last night. I could not kill the process. I had to restart php-fpm to get it to stop doing this.

My 1100s don’t have a history of doing this. I found posts going back past 2018 on this problem and apparently it still exists. I don’t know how to look for a root cause but will investigate if told where to look.

Thanks!

stephenw10

Have you only seen this one time? If not is it repeatable?

Mission-Ghost

@stephenw10 …unfortunately yes to both. I have no idea how to cause the failure.

Here’s the more specific steps that led up to the event:

The changes the router saw on the 25th was that the starlink dish went from being connected to the router via a netgear switch bringing the starlink connection from another part of the house over a dedicated vlan to being directly connected after I moved the dish to where it could reach the router.

To do this while maintaining internet connection, I stowed the dish then disabled the dish interface on the 1100. While disabling the interface was technically not necessary I wanted a clean path to the t-mobile gateway while I moved the dish and reconnected it and I didn’t want the dish to communicate with the router after I reconnected it until I was sure I was ready.

When I reconnected the dish and was ready, I enabled the interface and pfSense reestablished communication in load balancing with the dish and the existing t-mobile service. All good. Until 8pm when the cpu consumption increase 50 percentage points, which I noticed yesterday.

stephenw10

Hmm, so it could be seeing the link go down on WAN now when it previously it would not have because of the switch. Check the system logs to see if it's showing link events.

If that is somehow causing an issue you can set the WAN to not reflect the port status so it will always appear up.

Mission-Ghost

@stephenw10

Interesting idea, but I don't think so.

pfSense had no problem detecting when the Starlink link was down with the switch in place, since the link detection relied on pings. I had plenty of down events in the logs for Starlink since it tended to drop out a few nights a week in the early hours of the morning, especially when rebooting after a firmware update from the Starlink alien overlords.

T-Mobile Home Internet also drops out for a minute or two from time to time though not as often as Starlink. pfSense detects that fine and it's always been directly connected.

Starlink has become more reliable over the last year or so, presumably as more satellites entered production, but pfSense never had trouble detecting their down or up status with the Starlink connection passing through the switch. Note the Starlink VLAN served only the dish and the router port. No other traffic was on that VLAN.

Probably this is unneeded background, but just in case, Starlink was previously at the other end of the house where it was much easier to install. But the room got very hot in the summer making concerning me with longevity of the SL brick in the heat.

I recently developed a plan to move it to where it would be cooler but harder to install and where it would join the SG 1100, a switch and the T-Mobile Arcadyan gateway in the cooler space.

Another variable is over the last couple of years I have disabled the wan interfaces from time to time for various other experiments not related to moving the wans as in this case, and check_reload_status did not spin following re-enablement of those interfaces.

So, it doesn't seem related to the factor of disabling and enabling the wan interfaces alone.

stephenw10

There is a significant difference though between the gateway going down so monitoring shows a line drop and the WAN NIC actually losing link.

So, for example do you see logs like this:

Jan 26 16:05:07 	check_reload_status 	533 	Linkup starting $e6000sw0port2
Jan 26 16:05:07 	kernel 		e6000sw0port2: link state changed to DOWN
Jan 26 16:05:08 	php-fpm 	486 	/rc.linkup: Hotplug event detected for LAN(lan) dynamic IP address (4: 192.168.201.1, 6: track6)
Jan 26 16:05:08 	php-fpm 	486 	/rc.linkup: DEVD Ethernet detached event for lan

That's obviously for the LAN on my 1100 here but you can see it triggers check_reload_status.

Mission-Ghost

@stephenw10 said in Check_Reload_Status 100% CPU Again Again:

There is a significant difference though between the gateway going down so monitoring shows a line drop and the WAN NIC actually losing link.

So, for example do you see logs like this:
Jan 26 16:05:07 	check_reload_status 	533 	Linkup starting $e6000sw0port2
Jan 26 16:05:07 	kernel 		e6000sw0port2: link state changed to DOWN
Jan 26 16:05:08 	php-fpm 	486 	/rc.linkup: Hotplug event detected for LAN(lan) dynamic IP address (4: 192.168.201.1, 6: track6)
Jan 26 16:05:08 	php-fpm 	486 	/rc.linkup: DEVD Ethernet detached event for lan 
That's obviously for the LAN on my 1100 here but you can see it triggers check_reload_status.

I'd misunderstood.

Prior to Jan 25, there are loads of detached events for T-Mobile but none in recent history for Starlink. T-Mobile, you may recall, is directly connected to the SG 1100 and has been for nearly a year.

I don't unplug it, so I don't know why it's showing up like this. Bad cable seems more likely than the NIC in the SG 1100 going bad, but either is possible. The NIC for the T-Mobile doesn't have a history of going bad when it was serving a DSL connection instead.

(That said, T-Mobile has history of freaking out where Status>Monitoring>Quality shows latency going into hysteresis for days and then settling down again to steadily low values for weeks. The cable was a new Cat 6 drop cable when installed several months ago. I can switch it for another cable to test this theory. I'd been working a theory that it was the cell network, not the cable...but...maybe it's the cable; or the NIC. Is there a built-in diagnostic I can run to rule out the NIC and/or the cable? Rebooting the T-Mobile gateway has resolved the issue temporarily during past episodes.)

There aren't any detached events for Starlink going back months, until Jan 25 when I reconfigured the network. T-Mobile actually behaved on Jan 25 and I did not unplug it during the work.

stephenw10

I'd guess that's the tmobile router rebooting. Since the starlink WAN does not lose link it's probably not that. However it could be the tmobile device losing link that is triggering this. You might try setting that interface to not reflect the port state as a test.

Mission-Ghost

@stephenw10

I'm a bit skeptical the T-Mobile gateway is rebooting; prior to an experiment to reboot it to see if that restored the stability of the latency, it had run well in excess of three months continuously without rebooting, according to its uptime clock. Soft-rebooting it did get the latency back under control for a while.

I do have some experiments to conduct on T-Mobile which I will do.

However, I think the original unspoken point that check_reload_status shouldn't get stuck at 100% cpu for days regardless of the circumstances still holds.

Even if an interface is flapping, this module should be able to handle it. It doesn't seem to be trapping something well.

Probability is low in that In more than two years with the SG 1100s and countless interfaces up and down and experiments, this is the first time it's happened.

However, consequences are high, as it consumes an entire processor for no value added, and will do so for days on end. Other than a reboot or restart of php, I don't know if it would have stopped on its own.

Other threads observed this issue or similar bringing the router to it's knees or down entirely.

Thanks for your help with my thinking about troubleshooting the T-Mobile interface. I'll look into it.

stephenw10

I agree if it shows that uptime it's not rebooting. Odd then that it's somehow losing link.

I also agree that check_reload_status should not get stuck like that. As you found we have had issues with it in the past and they are difficult to pin down because it's normally not repeatable on demand. If we can narrow it down to something like a link state change that would be very helpful.