How can I troubleshot these log messages

guardian

Once or twice/day the service watchdog is reporting an unbound error. How can I find out what is causing this?
(It appears to happen about 2 minutes after pfBlocker updates - likely during a reload.

Furthermore the notify is reporting that it can't send an error message. It should be noted that the log entries below were from an email sent to the address (redacted) below, by the mail server (redacted) below. I'm wondering if this is due to the DNS/unbound issues or if the Service Watchdog has a separate email configuration that I have not set up properly.

Apr 18 09:15:00 pfsense php[72622]: [pfBlockerNG] Starting cron process.
Apr 18 09:17:00 pfsense php[16062]: servicewatchdog_cron.php: Service Watchdog detected service unbound stopped. Restarting unbound (DNS Resolver)
Apr 18 09:17:03 pfsense php[16062]: servicewatchdog_cron.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1681823823] unbound[26187:0] error: bind: address already in use [1681823823] unbound[26187:0] fatal error: could not open ports'
Apr 18 09:17:04 pfsense php[16062]: servicewatchdog_cron.php: The command '/usr/sbin/arp -s '172.16.50.1' 'xx:xx:xx:xx:xx:xx'' returned exit code '1', the output was 'arp: writing to routing socket: Operation not permitted'
Apr 18 09:17:37 pfsense php[17041]: notify_monitor.php: Could not send the message to --------@---------- -- Error: Failed to connect to ssl://------.--------.---:465 [SMTP: Failed to connect socket: php_network_getaddresses: getaddrinfo failed: Name does not resolve (code: -1, response: )]
Apr 18 09:19:09 pfsense php[72622]: [pfBlockerNG] No changes to Firewall rules, skipping Filter Reload
----------------------------------------
Apr 19 03:15:00 pfsense php[66287]: [pfBlockerNG] Starting cron process.
Apr 19 03:17:00 pfsense php[8612]: servicewatchdog_cron.php: Service Watchdog detected service unbound stopped. Restarting unbound (DNS Resolver)
Apr 19 03:17:25 pfsense php[8612]: servicewatchdog_cron.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1681888645] unbound[18841:0] error: bind: address already in use [1681888645] unbound[18841:0] fatal error: could not open ports'
Apr 19 03:17:26 pfsense php[8612]: servicewatchdog_cron.php: The command '/usr/sbin/arp -s '172.16.50.1' 'xx:xx:xx:xx:xx:xx'' returned exit code '1', the output was 'arp: writing to routing socket: Operation not permitted'
Apr 19 03:17:38 pfsense php[9203]: notify_monitor.php: Could not send the message to --------@---------- -- Error: Failed to connect to ssl://------.--------.---:465 [SMTP: Failed to connect socket: php_network_getaddresses: getaddrinfo failed: Name does not resolve (code: -1, response: )]
Apr 19 03:19:00 pfsense php[66287]: [pfBlockerNG] No changes to Firewall rules, skipping Filter Reload

SteveITS

@guardian said in How can I troubleshot these log messages:

unbound[26187:0] error: bind: address already in use

This would normally mean it's already running. Most likely watchdog is seeing pfB stop it, and so tries to (also) start it up again. I would suggest not having watchdog watch unbound.

guardian

@steveits said in How can I troubleshot these log messages:

@guardian said in How can I troubleshot these log messages:

unbound[26187:0] error: bind: address already in use

This would normally mean it's already running. Most likely watchdog is seeing pfB stop it, and so tries to (also) start it up again. I would suggest not having watchdog watch unbound.

Thanks for the quick reply @steveits - I put the watchdog in because I was having trouble with unbound hanging. So far I have noticed any problems - only reason that I know this was happening is because of the log messages.

Any way to delay the watchdog so it doesn't kick in for 5 minutes? That should be enough time for pfBlocker update to finish.

Do you think the "Could not send the message" error is because of the DNS issue, or should I be looking further?

SteveITS

@guardian I have not used watchdog. I have read several posts about it interfering with normal Suricata/Snort restarts, resulting in multiple processes running.

"Name does not resolve" does sound like a DNS issue.

Unbound does not normally crash/stop. I would fix that issue rather than use watchdog.

It looks like pfBlocker takes 4+ minutes to update...? That seems long but I have seen very large lists take 10+ minutes on ARM hardware. I thought it only restarted unbound at the end of the update, though I didn't double check that.

Gertjan

@guardian said in How can I troubleshot these log messages:

Once or twice/day the service watchdog is reporting an unbound error. How can I find out what is causing this?
(It appears to happen about 2 minutes after pfBlocker updates - likely during a reload.

Watchdog is stupid.
To make a short story shorter : remove it from your system.
Or forbid it to mess around with the unbound process.

Back to the beginning :
Unbound, if needed, restarts fast. There was no issue.
Then pfBlockerng came along, and we all fed it with big fat dnsbl lists.
Now unbound needs 'som time' time to stop and restart.
It could take minutes :

(I have some small list, just what I need - took 10 seconds or so)

I strongly advice that you do your own pfBlockerng => Force reload and look at what scrolls on the screen.
See my image for what to look for.

Do your own substraction ( Starting Unbound Resolver... completed [ 04/21/23 16:27:59 ]) - (
Saving DNSBL statistics... completed [ 04/21/23 16:27:55 ]) = the time it took.

If it's to long for you, there is one solution : use less dnsbl's.
( or : get that big Intel i9 based pfSense system with loads of memory ;) )

and if it takes a 'long time' "Watchdog " can detect it is 'down' and just restarts it ( because it's dumb ) ... while it is already under control of pfBlockerng, and is restarting, just that it takes sooo long.

Now, two instances are running .... and things go downhill for everybody.

So, again : do yourself a favor : "ditch it".

You don't need watchdog : unbound was just restarting and doing its work.
watchdog doesn't make thing go faster - it's the contrary.

Sorry for being a bit harsh.

@guardian said in How can I troubleshot these log messages:

Any way to delay the watchdog so it doesn't kick in for 5 minutes? That should be enough time for pfBlocker update to finish.

Noop.
It's dumb.
It's a cron timer (install the pfSense cron package and you can see it).
It starts, looks at the list - and for every process that has no 'pid' it fires a 'start'.

@guardian said in How can I troubleshot these log messages:

Do you think the "Could not send the message" error is because of the DNS issue,

Exact
As said :

@guardian said in How can I troubleshot these log messages:

php_network_getaddresses: getaddrinfo failed: Name does not resolve

which stands for : "no DNS - so no IP".

guardian

@gertjan @SteveITS thanks for the reply.

@gertjan said in How can I troubleshot these log messages:

I strongly advice that you do your own pfBlockerng => Force reload and look at what scrolls on the screen.
See my image for what to look for.

Can you please elaborate?

@gertjan said in How can I troubleshot these log messages:

If it's to long for you, there is one solution : use less dnsbl's.
( or : get that big Intel i9 based pfSense system with loads of memory ;) )
and if it takes a 'long time' "Watchdog " can detect it is 'down' and just restarts it ( because it's dumb ) ... while it is already under control of pfBlockerng, and is restarting, just that it takes sooo long.
Now, two instances are running .... and things go downhill for everybody.
So, again : do yourself a favor : "ditch it".
You don't need watchdog : unbound was just restarting and doing its work.
watchdog doesn't make thing go faster - it's the contrary.

@gertjan the reason I put in the watchdog is that I was having trouble with unbound getting messed up (likely by connectivity problems with the cable modem-or maybe something to do with pfBlocker-not sure). I put in watchdog to restart it which fixes the issue when connectivity returns. So far I have had no long term DNS issues, but that could be coincidence. Before I did that my wife couldn't work unless I was around to reset things.

Does this mean that unbound is failing, or that a second copy is being started?

servicewatchdog_cron.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1681823823] unbound[26187:0] error: bind: address already in use [1681823823] unbound[26187:0] fatal error: could not open ports'

So I need to do something - I could write a little script that runs every few minutes and checks - I just need a way to differentiate when unbound has hard failed. Suggestions?

SteveITS

@guardian it means the address:port unbound wants to use is already being used…I.e. it’s already running.

guardian

@steveits said in How can I troubleshot these log messages:

@guardian it means the address:port unbound wants to use is already being used…I.e. it’s already running.

So is unbound becoming unresponsive because it is loading a massive list, and watchdog thinks it's dead and starting another copy?

Is there any reliable way to tell if unbound is really hung, or if it's just busy reloading?

I occasionally get hangs that are not just reloading, but where unbound doesn't seem to recover, so I need to come up with some reliable mechanism to detect an error and force a clean restart.

stephenw10

@guardian said in How can I troubleshot these log messages:

Is there any reliable way to tell if unbound is really hung, or if it's just busy reloading?

Not really. Since if it takes that long to load the config Unbound really isn't running during that time.

You should not use the Service Watchdog for Unbound.