Netgate 1100 becoming unresponsive intermittently
-
I have a client with a brand-new Netgate 1100. She reports that it becomes unresponsive intermittently a couple of times a week. When that happens she neither has internet access nor can she log into the WebConfigurator interface. Because in those instances she has just power-cycled the unit, we only recently discovered that the device eventually recovers and starts operating normally again.
What could be the cause of this malfunction?
Here are the system logs from today:
Apr 8 04:18:03 rc.gateway_alarm 39093 >>> Gateway alarm: WAN_DHCP (Addr:100.93.160.1 Alarm:1 RTT:2.930ms RTTsd:1.385ms Loss:21%) Apr 8 04:18:03 check_reload_status 463 updating dyndns WAN_DHCP Apr 8 04:18:03 check_reload_status 463 Restarting IPsec tunnels Apr 8 04:18:03 check_reload_status 463 Restarting OpenVPN tunnels/interfaces Apr 8 04:18:03 check_reload_status 463 Reloading filter Apr 8 04:18:05 php-fpm 740 /rc.openvpn: Gateway, none 'available' for inet, use the first one configured. 'WAN_DHCP' Apr 8 04:18:05 php-fpm 740 /rc.openvpn: Gateway, NONE AVAILABLE Apr 8 04:18:05 php-fpm 740 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use WAN_DHCP. Apr 8 05:06:00 sshguard 68389 Exiting on signal. Apr 8 05:06:00 sshguard 78042 Now monitoring attacks. Apr 8 06:02:00 sshguard 78042 Exiting on signal. Apr 8 06:02:00 sshguard 54186 Now monitoring attacks. Apr 8 06:36:00 sshguard 54186 Exiting on signal. Apr 8 06:36:00 sshguard 64978 Now monitoring attacks. Apr 8 06:58:00 sshguard 64978 Exiting on signal. Apr 8 06:58:00 sshguard 20813 Now monitoring attacks. Apr 8 07:54:00 sshguard 20813 Exiting on signal. Apr 8 07:54:00 sshguard 13378 Now monitoring attacks. Apr 8 07:55:00 sshguard 13378 Exiting on signal. Apr 8 07:55:00 sshguard 37932 Now monitoring attacks. Apr 8 08:50:00 sshguard 37932 Exiting on signal. Apr 8 08:50:00 sshguard 96590 Now monitoring attacks. Apr 8 09:05:09 check_reload_status 463 rc.newwanip starting mvneta0.4090 Apr 8 09:05:10 php-fpm 415 /rc.newwanip: rc.newwanip: Info: starting on mvneta0.4090. Apr 8 09:05:10 php-fpm 415 /rc.newwanip: rc.newwanip: on (IP address: 100.93.183.31) (interface: WAN[wan]) (real interface: mvneta0.4090). Apr 8 09:05:10 check_reload_status 463 Reloading filter Apr 8 09:05:28 php-fpm 740 /index.php: webConfigurator authentication error for user 'admin ' from: 192.168.42.122 Apr 8 09:06:06 rc.gateway_alarm 13935 >>> Gateway alarm: WAN_DHCP (Addr:100.93.160.1 Alarm:0 RTT:3.029ms RTTsd:1.577ms Loss:6%) Apr 8 09:06:06 check_reload_status 463 updating dyndns WAN_DHCP Apr 8 09:06:06 check_reload_status 463 Restarting IPsec tunnels Apr 8 09:06:06 check_reload_status 463 Restarting OpenVPN tunnels/interfaces Apr 8 09:06:06 check_reload_status 463 Reloading filter Apr 8 09:06:08 php-fpm 415 /rc.openvpn: Gateway, NONE AVAILABLE Apr 8 09:06:08 php-fpm 415 /rc.openvpn: Gateway, NONE AVAILABLE Apr 8 09:06:08 php-fpm 415 /rc.openvpn: OpenVPN: One or more OpenVPN tunnel endpoints may have changed IP addresses. Reloading endpoints that may use WAN_DHCP. Apr 8 10:08:21 php-fpm 416 /index.php: Session timed out for user 'admin' from: 192.168.1.33 (Local Database) Apr 8 10:08:40 php-fpm 416 /index.php: Successful login for user 'admin' from: 192.168.1.33 (Local Database) Apr 8 10:08:41 php-fpm 416 /index.php: The command '/usr/local/sbin/ping-auth -s > /etc/thoth/thothid 2>/dev/null' returned exit code '127', the output was ''
-
I also looked at the DNS Resolver logs and found these suspicious entries:
Apr 8 09:05:10 unbound 86407 [86407:0] info: service stopped (unbound 1.18.0). Apr 8 09:05:10 unbound 86407 [86407:0] info: server stats for thread 0: 1081 queries, 326 answers from cache, 755 recursions, 0 prefetch, 0 rejected by ip ratelimiting Apr 8 09:05:10 unbound 86407 [86407:0] info: server stats for thread 0: requestlist max 6 avg 1.49934 exceeded 0 jostled 0 Apr 8 09:05:10 unbound 86407 [86407:0] info: average recursion processing time 69.285267 sec Apr 8 09:05:10 unbound 86407 [86407:0] info: histogram of recursion processing times Apr 8 09:05:10 unbound 86407 [86407:0] info: [25%]=5.55147e-07 median[50%]=16.4444 [75%]=101.3 Apr 8 09:05:10 unbound 86407 [86407:0] info: lower(secs) upper(secs) recursions Apr 8 09:05:10 unbound 86407 [86407:0] info: 0.000000 0.000001 340 Apr 8 09:05:10 unbound 86407 [86407:0] info: 0.000256 0.000512 1 Apr 8 09:05:10 unbound 86407 [86407:0] info: 0.008192 0.016384 2
I cannot be sure, however, but recursions that are not terminated can cause a system hang. Is this normal?
I looked at my own appliance’s unbound log, which has recursions that last a small fraction of the time those of my client take:
Mar 6 22:22:18 unbound 17207 [17207:0] info: 0.131072 0.262144 8 Mar 6 22:22:18 unbound 17207 [17207:0] info: server stats for thread 1: 229 queries, 72 answers from cache, 157 recursions, 0 prefetch, 0 rejected by ip ratelimiting Mar 6 22:22:18 unbound 17207 [17207:0] info: server stats for thread 1: requestlist max 5 avg 0.55414 exceeded 0 jostled 0 Mar 6 22:22:18 unbound 17207 [17207:0] info: average recursion processing time 0.039812 sec Mar 6 22:22:18 unbound 17207 [17207:0] info: histogram of recursion processing times Mar 6 22:22:18 unbound 17207 [17207:0] info: [25%]=0.00949527 median[50%]=0.0224619 [75%]=0.052992 Mar 6 22:22:18 unbound 17207 [17207:0] info: lower(secs) upper(secs) recursions Mar 6 22:22:18 unbound 17207 [17207:0] info: 0.000000 0.000001 5
-
-
You have an OpenVPN gateway configured on the system?
Make sure the system default gateway in Sys > Routing > Gateways is set to WAN_DHCP specifically and not automatic if so. Otherwise if the WAN gateway ever goes down the system may try to switch to a different gateway and choose the OpenVPN gateway which is probably invalid. It will not switch back unless that too goes down or the system is rebooted.
Steve
-
@stephenw10: Thanks very much! Excellent point!