100% /usr/local/sbin/check_reload_status after gateway down
-
Every time?
How are the interfaces configured/connected?
-
Hello!
Around 30% of my routers running 23.05.1 had check_reload_status stuck at 100%. This was not a huge problem for the faster boxes, but the slower ones (sg-3100) were inaccessible from the gui (gateway timeout). Could not kill the check_reload_status process. They needed a reboot from the shell.
I have no idea how they got in this state. Some are multi-wan. DHCP and static IP WANs. All of them have the wan occasionally go down and come back. Pretty basic setups. Nothing odd.
check_reload_status seems to have a history...
https://forum.netgate.com/topic/112573/what-is-check_reload_status
https://redmine.pfsense.org/issues/2555
John
-
@stephenw10 Not every time. But every 2-3th time. WAN1 and WAN2 are connected physically via PPPoE and WAN3 (the 5G test WAN) is connected via VLAN and DHCP. I will try to use another physical port for that interface to see if this issue still exists. But I think the gateway handling with 23.05.1 is the problem. Since that release I also do have the issue that the gateway for my Wireguard connection is disabled after bootup.
-
A regression since 23.05?
There is history with check_reload_status, yes, but no recent bugs for it.
Steve
-
@stephenw10 said in 100% /usr/local/sbin/check_reload_status after gateway down:
A regression since 23.05?
I think it is, but unfortunately I can't diagnose it further.
I sold a SG-3100 to a friend that lives in another state, and he is facing the same problem.
PPPoE also.. -
And he definitely wasn't seeing it in 23.05?
-
@stephenw10 said in 100% /usr/local/sbin/check_reload_status after gateway down:
And he definitely wasn't seeing it in 23.05?
Unfortunately I'm not sure.. He is not the type of the guy that monitors CPU usage and other metrics.
He called me to help him in something I found that problem and he was already running 23.05.1All I can say is that he is using 23.05.1, multi WAN configuration, one link is PPPoE and the other is not.
-
Mmm, hard to see what might have caused that between 23.05 and 23.05.1. Far more likely to have been introduced since 23.01. Though still nothing obviously that might have caused it.
-
@stephenw10 said in 100% /usr/local/sbin/check_reload_status after gateway down:
Though still nothing obviously that might have caused it.
It happened three times already, sorry I can't help further..
I asked him to buy a modem and let the modem handle the PPPoE for the time being, he will use the DMZ in the modem..
He is not using uPNP, IPv6 track interface, Voip or anything like that, so I think the DMZ option will work fine for him..If there is something that I can check during the next event, just tell me and I'll check with him next time.
-
Hello!
I traced the point of failure on two routers back to 8/3 using Status -> Monitoring.
They are both in the same comcast service area that was having persistent outages for about 2 hours - WANs going up and down.
One router is single wan dhcp. The other is dual wan comcast static ip and dsl pppoe.
The single wan is a sg-3100 and the dual wan is a fw4b. Both on 23.05.1
Both routers seemed to be running fine for almost 30days with check_reload_status stuck at 100%. I only noticed it on the 3100 when I tried to login.
Even though check_reload_status was at 100%, the system load was significantly lower. System/nice util at 13% on the fw4b and system 35% / nice 15% on the 3100.
I probably never would have noticed it if I didnt need to login to the 3100.
What (other) problems can check_reload_status cause when it is stuck?
John
-
Well it uses CPU cycles so you would see reduced throughput etc if you were already close to the hardware limits.
Let me see what I can fins here.
-
@stephenw10 Right now it happened again. I figured out that after reconnecting the 5G modem the WebGUI gets very unresponsive for couple of minutes. Also another gateway (CYBERGHOST) is losing it's connection (seems to be restarted). Took 5 minutes until the 5G got it's public IPv4 via passthrough mode. After that check_reload_status is at 100% again.
This is the log while that happened:
Aug 31 17:01:26 vnstatd 28204 Error: pidfile "/var/run/vnstat/vnstat.pid" lock failed (Resource temporarily unavailable), exiting. Aug 31 17:01:26 kernel igc1.300: promiscuous mode disabled Aug 31 17:01:26 php-fpm 80144 /rc.dyndns.update: phpDynDNS (xxx.xxx.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. Aug 31 17:01:25 check_reload_status 63049 rc.newwanip starting ovpnc4 Aug 31 17:01:25 kernel ovpnc4: link state changed to UP Aug 31 17:01:25 php-fpm 41840 /rc.newwanip: rc.newwanip: on (IP address: 10.112.1.1) (interface: OPENVPN_SRV[opt11]) (real interface: ovpns3). Aug 31 17:01:25 php-fpm 41840 /rc.newwanip: rc.newwanip: Info: starting on ovpns3. Aug 31 17:01:25 php-fpm 41840 OpenVPN PID written: 59390 Aug 31 17:01:24 check_reload_status 63049 Reloading filter Aug 31 17:01:24 kernel ovpnc4: link state changed to DOWN Aug 31 17:01:24 php-fpm 41840 OpenVPN terminate old pid: 7849 Aug 31 17:01:23 php-fpm 41840 /rc.openvpn: OpenVPN: Resync client4 CYBERGHOST Aug 31 17:01:23 check_reload_status 63049 rc.newwanip starting ovpns3 Aug 31 17:01:23 kernel ovpns3: link state changed to UP Aug 31 17:01:23 php-fpm 41840 OpenVPN PID written: 49415 Aug 31 17:01:23 php-fpm 80144 /rc.dyndns.update: phpDynDNS (xxx.xxx.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. Aug 31 17:01:23 check_reload_status 63049 Reloading filter Aug 31 17:01:23 kernel ovpns3: link state changed to DOWN Aug 31 17:01:23 php-fpm 41840 OpenVPN terminate old pid: 53852 Aug 31 17:01:23 php-fpm 81311 /rc.dyndns.update: phpDynDNS (xxx.xxx.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. Aug 31 17:01:22 php-fpm 41840 /rc.openvpn: OpenVPN: Resync server3 OpenVPN Server Aug 31 17:01:21 php-fpm 41840 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use WAN_5G. Aug 31 17:01:21 check_reload_status 63049 rc.newwanip starting ovpnc4 Aug 31 17:01:21 kernel ovpnc4: link state changed to UP Aug 31 17:01:21 snmpd 68386 disk_OS_get_disks: adding device 'nvd0' to device list Aug 31 17:01:21 snmpd 68386 disk_OS_get_disks: adding device 'mmcsd0boot0' to device list Aug 31 17:01:21 snmpd 68386 disk_OS_get_disks: adding device 'mmcsd0boot1' to device list Aug 31 17:01:21 check_reload_status 63049 Updating static routes based on hostnames Aug 31 17:01:21 check_reload_status 63049 Reloading filter Aug 31 17:01:21 php-fpm 80144 OpenVPN PID written: 7849 Aug 31 17:01:20 php-fpm 60232 /rc.start_packages: Restarting/Starting all packages. Aug 31 17:01:20 php-fpm 41840 /rc.newroutedns: Static Routes: One or more aliases used for routing has changed its IP. Refreshing. Aug 31 17:01:20 check_reload_status 63049 Reloading filter Aug 31 17:01:20 check_reload_status 63049 Starting packages Aug 31 17:01:20 php-fpm 60232 /rc.newwanip: Netgate pfSense Plus package system has detected an IP change or dynamic WAN reconnection - 10.2.5.70 -> 10.4.5.5 - Restarting packages. Aug 31 17:01:20 kernel ovpnc4: link state changed to DOWN Aug 31 17:01:20 php-fpm 80144 OpenVPN terminate old pid: 5686 Aug 31 17:01:19 php-fpm 80144 /rc.openvpn: OpenVPN: Resync client4 CYBERGHOST Aug 31 17:01:19 check_reload_status 63049 rc.newwanip starting ovpns3 Aug 31 17:01:19 php-fpm 80144 OpenVPN PID written: 53852 Aug 31 17:01:19 kernel ovpns3: link state changed to UP Aug 31 17:01:19 php-fpm 81311 /rc.dyndns.update: phpDynDNS (rv1125g.homeip.net): No change in my IP address and/or 25 days has not passed. Not updating dynamic DNS entry. Aug 31 17:01:18 kernel ovpns3: link state changed to DOWN Aug 31 17:01:18 devd 907 notify_clients: send() failed; dropping unresponsive client Aug 31 17:01:18 php-fpm 80144 OpenVPN terminate old pid: 78480 Aug 31 17:01:18 php-fpm 60232 /rc.newwanip: Creating rrd update script Aug 31 17:01:18 php-fpm 60232 /rc.newwanip: Ignoring IPsec reload since there are no tunnels on interface opt10 Aug 31 17:01:17 php-fpm 80144 /rc.openvpn: OpenVPN: Resync server3 OpenVPN Server Aug 31 17:01:17 php-fpm 60232 /rc.newwanip: IP Address has changed, killing states on former IP Address 10.2.5.70. Aug 31 17:01:17 php-fpm 80144 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use WAN_5G. Aug 31 17:01:17 check_reload_status 63049 Restarting OpenVPN tunnels/interfaces Aug 31 17:01:17 check_reload_status 63049 Restarting IPsec tunnels Aug 31 17:01:17 check_reload_status 63049 updating dyndns WAN_5G Aug 31 17:01:17 rc.gateway_alarm 74674 >>> Gateway alarm: WAN_5G (Addr:4.2.2.2 Alarm:1 RTT:0ms RTTsd:0ms Loss:100%) Aug 31 17:01:16 php_pfb 86922 [pfBlockerNG] filterlog daemon started Aug 31 17:01:15 tail_pfb 86542 [pfBlockerNG] Firewall Filter Service started Aug 31 17:01:15 php_pfb 82278 [pfBlockerNG] filterlog daemon stopped Aug 31 17:01:15 tail_pfb 82104 [pfBlockerNG] Firewall Filter Service stopped Aug 31 17:01:15 vnstatd 76188 Error: pidfile "/var/run/vnstat/vnstat.pid" lock failed (Resource temporarily unavailable), exiting. Aug 31 17:01:15 snmpd 56099 disk_OS_get_disks: adding device 'nvd0' to device list Aug 31 17:01:15 snmpd 56099 disk_OS_get_disks: adding device 'mmcsd0boot0' to device list Aug 31 17:01:15 snmpd 56099 disk_OS_get_disks: adding device 'mmcsd0boot1' to device list Aug 31 17:01:15 php-fpm 81311 /rc.newroutedns: Static Routes: One or more aliases used for routing has changed its IP. Refreshing. Aug 31 17:01:15 php-fpm 81311 /rc.start_packages: Skipping STARTing packages process because previous/another instance is already running Aug 31 17:01:15 check_reload_status 63049 Reloading filter Aug 31 17:01:15 check_reload_status 63049 Starting packages
-
Hmm, it actually took 5mins? Because the 5G router didn't have an IP for that long?
That is at least unusual. Obviously that shouldn't make any difference but....
-
@stephenw10 5G modem took around 1 minute to bootup and getting an IP address.
-
And pfSense took a further 4 mins to pull a dhcp lease?
What was logged in that time? In the dhcp and/or system log?
-
@stephenw10 Unfortunately did not check that. I did now use a physical port for that connection and will see if the error still appears. Otherwise it's an issue with VLAN WAN interfaces as it seems.
-
I've yet to see this here on any system. If anyone has a way to replicate this please let us know so we can dig into it.
-
I have check_reload_status at 100% too. It must be related to OpenVPN for me when the interface is removed from my gateway routing group.
-
Only removing it from the group prevents it? The interface is still assigned?
-
I'm seeing "check_reload_status" persistently using 50-99% of CPU for over a month now.
Netgate 3100,
23.05-RELEASE (arm)
built on Mon May 22 15:04:22 UTC 2023
FreeBSD 14.0-CURRENTlast pid: 12641; load averages: 1.26, 1.25, 1.19 up 93+21:02:37 12:16:05 73 processes: 2 running, 63 sleeping, 8 zombie CPU: 2.5% user, 14.9% nice, 36.7% system, 0.2% interrupt, 45.7% idle Mem: 88M Active, 1413M Inact, 255M Laundry, 168M Wired, 58M Buf, 74M Free Displays to show (currently infinite): PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND 439 root 1 155 20 5100K 1960K CPU1 1 782.7H 98.29% check_reload_status 73655 root 2 20 0 12M 5736K select 0 489:13 4.22% openvpn 61330 squid 1 20 0 558M 525M kqread 0 599:50 0.40% squid 37787 root 1 20 0 6704K 2904K CPU0 0 0:02 0.14% top 12641 root 1 29 0 4340K 1668K nanslp 0 0:00 0.13% sleep 62192 root 1 29 0 4988K 1924K wait 1 0:40 0.06% sh 24342 root 1 20 0 5480K 2332K bpf 1 71:28 0.05% filterlog 64048 root 1 20 0 5264K 2128K select 0 56:54 0.03% syslogd 54611 nobody 1 20 0 8528K 3460K select 1 86:17 0.03% dnsmasq 29638 root 1 20 0 14M 4936K select 0 26:23 0.02% mpd5 6186 root 1 20 0 19M 12M bpf 0 8:18 0.02% bandwidthd 6452 root 1 20 0 12M 5664K bpf 0 8:34 0.02% bandwidthd 6236 root 1 20 0 21M 15M bpf 0 8:09 0.02% bandwidthd 3705 root 1 20 0 16M 10M bpf 0 7:56 0.02% bandwidthd 3659 root 1 20 0 19M 12M bpf 0 8:01 0.02% bandwidthd 3703 root 1 20 0 21M 15M bpf 1 7:51 0.02% bandwidthd 6281 root 1 20 0 16M 10M bpf 0 8:13 0.01% bandwidthd 60680 adamw 1 20 0 12M 7112K select 0 0:00 0.01% sshd 3837 root 1 20 0 12M 5664K bpf 0 8:15 0.01% bandwidthd 87030 root 1 20 0 14M 10M select 0 10:52 0.01% bsnmpd 24506 root 5 68 0 7700K 2200K uwait 1 2:52 0.01% dpinger 13204 squid 1 20 0 10M 5368K select 1 0:05 0.01% pinger 24832 root 5 68 0 5508K 2156K uwait 0 3:16 0.01% dpinger 61947 root 1 20 0 11M 3968K select 1 8:23 0.01% ntpd 398 root 1 20 0 92M 17M kqread 0 4:05 0.00% php-fpm 1904 root 1 20 0 10M 5304K kqread 0 1:00 0.00% lighttpd_ls 78947 root 1 20 0 5544K 2192K bpf 0 2:39 0.00% choparp 16566 root 2 20 0 12M 5332K select 1 51:55 0.00% openvpn 66894 root 1 68 20 5184K 1996K wait 1 12:57 0.00% sh 61385 squid 1 20 0 7128K 2580K piperd 1 3:46 0.00% unlinkd 3420 root 1 20 0 5032K 1848K nanslp 0 1:39 0.00% cron 79339 root 1 20 0 12M 4736K select 0 0:51 0.00% sshd 15682 root 1 68 0 128M 34M accept 1 0:45 0.00% php-fpm 23 root 1 20 0 127M 35M accept 1 0:4498124 senphp-fpm 67340 root 1 68 0 127M 35M accept 1 0:39 0.00% php-fpm 32610 root 1 36 0 129M 35M accept 1 0:29 0.00% php-fpm 17232 root 1 68 0 126M 33M accept 1 0:27 0.00% php-fpm 52591 root 3 20 0 9960K 2388K usem 1 0:18 0.00% filterdns 69767 root 1 20 0 21M 6548K kqread 1 0:11 0.00% nginx 64797 root 1 20 0 4480K 1528K nanslp 0 0:04 0.00% minicron 763 root 1 20 0 6296K 2368K select 1 0:04 0.00% devd 66615 root 1 31 0 4476K 1524K nanslp 1 0:03 0.00% minicron 70082 root 1 20 0 22M 6724K kqread 1 0:02 0.00% nginx 66007 root 1 20 0 4504K 1532K nanslp 0 0:00 0.00% minicron 60088 root 1 20 0 12M 6848K select 0 0:00 0.00% sshd 34483 adamw 1 20 0 10M 6388K select 0 0:00 0.00% sudo 60799 adamw 1 20 0 6704K 3320K pause 1 0:00 0.00% tcsh 72912 root 2 20 0 5448K 2160K piperd 0 0:00 0.00% sshg-blocker 37069 root 1 20 0 5792K 2792K wait 1 0:00 0.00% sh 72456 root 1 26 0 5196K 2084K wait 1 0:00 0.00% sh 61087 squid 1 20 0 41M 8444K wait 0 0:00 0.00% squid 66474 root 1 20 0 4480K 1528K nanslp 0 0:00 0.00% minicron 36993 root 1 21 0 5148K 2332K wait 0 0:00 0.00% su 72856 root 1 20 0 11M 4148K piperd 0 0:00 0.00% sshg-parser 23365 root 1 68 0 4536K 1564K ttyin 0 0:00 0.00% getty 72826 root 1 20 0 4504K 1772K piperd 1 0:00 0.00% cat 73384 root 1 34 0 5148K 2068K piperd 1 0:00 0.00% sh 8116 root 1 68 20 4340K 1668K nanslp 1 0:00 0.00% sleep 69622 root 1 31 0 20M 4952K pause 1 0:00 0.00% nginx 65086 root 1 68 0 4456K 1520K wait 0 0:00 0.00% minicron 64593 root 1 68 0 4456K 1520K wait 1 0:00 0.00% minicron 66112 root 1 68 0 4456K 1520K wait 1 0:00 0.00% minicron 441 root 1 68 20 5048K 1860K kqread 0 0:00 0.00% check_reload_status 65747 root 1 68 0 4456K 1520K wait 1 0:00 0.00% minicron 73096 root 1 26 0 5196K 2080K wait 0 0:00 0.00% sh
Please advise if you would like to see anything else before I reboot the firewall tomorrow.