Service Watchdog spamming



  • Service Watchdog keeps spamming me with emails (~ every 30 seconds) that services have been detected stopped and it is restarting.

    code
    
    10:05:00 Service Watchdog detected service avahi stopped. Restarting avahi (Avahi Zeroconf/mDNS Daemon)
    
    10:09:00 Service Watchdog detected service avahi stopped. Restarting avahi (Avahi Zeroconf/mDNS Daemon)
    
    10:13:00 Service Watchdog detected service dhcpd stopped. Restarting dhcpd (DHCP Service)
    
    
    

    This is just a small sample. Also getting similar messages for unbound and OpenVPN.

    Unable to connect to the pfSense box via vpn. I can sometimes get the login page to load but can’t stay connected for long enough for the login to complete and I’m not at home right now. Same thing happened at roughly this same time yesterday morning. I was able to stop it by rebooting the box after I got home from work.

    The only recent change I have made to pfSense was installing Service Watchdog. Is it possible that this package is causing this behavior?


  • Rebel Alliance Global Moderator

    I have service watchdog installed monitoring 4 services. unbound, freerad and 2 of my openvpn instances not seeing such behavior



  • Maybe I’ll try uninstalling and reinstalling? Not sure what else to try. It looks like Watchdog is constantly trying to restart the other services and can’t.


  • Rebel Alliance Global Moderator

    the service watch dog will only monitor what you set it to monitor. If your not running those service (have them disabled for example) then they shouldn't be listed in your watchdog setup.



  • They should all be running. None are disabled, but it looks as if Watchdog can’t restart them or perhaps they keep crashing?


  • Rebel Alliance Developer Netgate

    Odds are this has nothing to do with Service Watchdog. Something is causing those processes to restart or die. You need to find out what that something is, and it wouldn't be the Service Watchdog package. Check your logs for more info. Could be an unstable or flapping WAN/interface for example.



  • @jimp said in Service Watchdog spamming:

    Odds are this has nothing to do with Service Watchdog. Something is causing those processes to restart or die. You need to find out what that something is, and it wouldn't be the Service Watchdog package. Check your logs for more info. Could be an unstable or flapping WAN/interface for example.

    Thanks. I saw in the gateway logs for yesterday’s event that it looked like it all started with a high latency alarm, but not sure where to proceed from there to resolve this? Rebooting the router cleared the problem for a while (18 hours or so) but it has recurred again today so obviously not fixed.



  • This is a small clip from my gateway log at the time that watchdog started spewing emails-

    code
    Jun 14 09:24:50	dpinger		WAN_DHCP 8.8.8.8: Alarm latency 17458us stddev 2171us loss 41%
    Jun 14 09:24:37	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:24:21	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:24:19	dpinger		WAN_DHCP 8.8.8.8: Alarm latency 17361us stddev 4058us loss 45%
    Jun 14 09:24:07	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:23:58	dpinger		WAN_DHCP6 2620:0:ccc::2: Alarm latency 0us stddev 0us loss 100%
    Jun 14 09:23:56	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 2620:0:ccc::2 bind_addr 2606:a000:bfc0:c9:29f8:1706:da34:1702 identifier "WAN_DHCP6 "
    Jun 14 09:23:49	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:23:41	dpinger		WAN_DHCP6 2620:0:ccc::2: Alarm latency 0us stddev 0us loss 100%
    Jun 14 09:23:39	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 2620:0:ccc::2 bind_addr 2606:a000:bfc0:c9:29f8:1706:da34:1702 identifier "WAN_DHCP6 "
    Jun 14 09:23:32	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:23:19	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:23:11	dpinger		WAN_DHCP6 2620:0:ccc::2: Alarm latency 0us stddev 0us loss 100%
    Jun 14 09:23:09	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 2620:0:ccc::2 bind_addr 2606:a000:bfc0:c9:29f8:1706:da34:1702 identifier "WAN_DHCP6 "
    Jun 14 09:23:01	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:22:54	dpinger		WAN_DHCP6 2620:0:ccc::2: Alarm latency 0us stddev 0us loss 100%
    Jun 14 09:22:52	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 2620:0:ccc::2 bind_addr 2606:a000:bfc0:c9:29f8:1706:da34:1702 identifier "WAN_DHCP6 "
    Jun 14 09:22:44	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:22:29	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:22:07	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:22:03	dpinger		WAN_DHCP 8.8.8.8: Alarm latency 17380us stddev 2134us loss 44%
    Jun 14 09:21:53	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:21:36	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:21:20	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:21:12	dpinger		WAN_DHCP6 2620:0:ccc::2: Alarm latency 0us stddev 0us loss 100%
    Jun 14 09:21:10	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 2620:0:ccc::2 bind_addr 2606:a000:bfc0:c9:29f8:1706:da34:1702 identifier "WAN_DHCP6 "
    Jun 14 09:21:02	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    Jun 14 09:20:56	dpinger		WAN_DHCP6 2620:0:ccc::2: Alarm latency 35689us stddev 0us loss 50%
    Jun 14 09:20:53	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 2620:0:ccc::2 bind_addr 2606:a000:bfc0:c9:29f8:1706:da34:1702 identifier "WAN_DHCP6 "
    Jun 14 09:20:45	dpinger		send_interval 1000ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 600ms loss_alarm 40% dest_addr 8.8.8.8 bind_addr 76.182.14.197 identifier "WAN_DHCP "
    

    Not sure where to start troubleshooting this though. Is it a problem with pfsense or with my ISP? Rebooting the pfsense router has stopped the emails for now. Packet loss is showing 0%.



  • Is it possible that this issue could be hardware related?

    I noticed yesterday when I was manually rebooting the router that it was fairly hot. Never noticed this before but I could have missed it I guess. Since I have the thermostat in my house set to raise the room temp to 85F when no one is home the timing of these occurances (started again about 10am today) would coincide with the higher room temperatures.

    I’m not sure if my theory is really possible or not? The dashboard doesn’t display any abnormal cpu temps after reboot, but the hot case (it even smells hot) doesn’t make much sense when cpu loads are very low.

    I’m thinking about migrating to an SG-3100 but would really like to be able to confirm the hardware problem first if anyone has any ideas?


  • Rebel Alliance Developer Netgate

    It's possible the high temperatures are causing some piece of gear (be it your firewall hardware, your ISP CPE, or maybe a switch) to flake out and take errors/drop link, but you would have to isolate and test to find out.

    If the temp swings are rapid and/or the humidity gets high you could also be dealing with condensation.