UNBOUND error: bind: address already in use fatal error: could not open ports

Mr. Jingles

G'evening,

I googled it, but it seems it's a 2011/2012 problem (?)


Jan 29 18:10:00  php-cgi  servicewatchdog_cron.php: Service Watchdog detected service unbound stopped. Restarting unbound (Unbound DNS Resolver)        
Jan 29 18:09:40  php-fpm /services_dhcp.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1485709780] unbound[38648:0] error: bind: address already in use [1485709780] unbound[38648:0] fatal error: could not open ports'

I'm on 2.3.2-p1.

I already rebooted, didn't solve it.

doktornotor

Hint: Do NOT ever add Unbound to Service Watchdog. Especially not if using pfBlockerNG.

Mr. Jingles

Good to see you're back, Dok ;D

Ok, I'll remove it from serwadog. Strange, that a wadog would cause this, but what the bleep do I know :P

doktornotor

When Unbound restarts on its own (DHCP registration, pfBNG…), the watchdog cronjob often spots this and tries to restart it on its own as well. Trying to start it twice at the same time will fail. Really no good.

Mr. Jingles

You were right, Dok, thank you. Removing it = no unbound crashes in 24 hours.

luckman212

I'm on 2.4 snapshots (2.4.0.b.20170209.0450 to be exact) and I am NOT using Service Watchdog. Don't even have it installed.

I'm getting this error too from rc.newwanip about not being able to start Unbound. It happens after WAN flaps. There's some kind of deadlock or race condition going on. If I just wait a few seconds, and then manually start the service from the dash it comes up. Strange timing issue. But Unbound is such a critical part of the system, it really does need a watchdog.

pfcode

@luckman212:

I'm on 2.4 snapshots (2.4.0.b.20170209.0450 to be exact) and I am NOT using Service Watchdog. Don't even have it installed.

I'm getting this error too from rc.newwanip about not being able to start Unbound. It happens after WAN flaps. There's some kind of deadlock or race condition going on. If I just wait a few seconds, and then manually start the service from the dash it comes up. Strange timing issue. But Unbound is such a critical part of the system, it really does need a watchdog.

if you have pfBlockerNG DNSBL enabled, then disabling it will fix.

luckman212

@pfcode:

if you have pfBlockerNG DNSBL enabled, then disabling it will fix.

But, I don't have pfBNG installed.

SteveITS

luckman212, did you ever find a fix for this issue of unbound not restarting successfully? We've been running into the same thing on a HA pair where both are acting as DNS servers. When the daily Suricata rules update and hence sync happens, the packages are restarted on the backup router and it logs:

/xmlrpc.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1493810636] unbound[78923:0] error: bind: address already in use [1493810636] unbound[78923:0] fatal error: could not open ports'

I think this is relatively new (we have 2.3.3_1 currently) but I'm not quite sure when it started. I know it occurred at least once before installing pfBlockerNG, and disabling pfBlockerNG on both routers or even uninstalling pfBlockerNG had no effect.

The weird part is I can go to the Suricata sync or the HA sync page and save, which seems to trigger a sync, without issue. And rule updates that sync don't cause this issue.

BBcan177

Try the following patch:
https://redmine.pfsense.org/issues/7326#note-2

SteveITS

Thanks for the pointer. I forgot this forum doesn't automatically subscribe you to threads when you reply, so it took a day to remember to check for a reply.

I had found that bug report also but was hesitant to start modifying code. I did so and overnight our backup router logged 18 "waiting 1 second" messages over a 24 second span, but successfully started unbound. It seems weird that this only affects our backup router (I didn't even put the fix into our master), and I'm fairly confident it hasn't been an issue for the entire time we've been using Suricata…since last fall or maybe summer, without looking.

It doesn't appear your fix/workaround made it in today's 2.3.4 release.

PS - thanks for pfBlockerNG. :)

Edit: curiously, every day since then the router's logged only one "waiting 1 second" message logged. Still, it's apparently 1 second longer than the "kill" assumes.

SteveITS

For posterity, it looks like the last patch in that redmine bug made it into 2.3.4_1.

My last post said our router only logged one "waiting" message but since then I've seen days where it logs a dozen, give or take, so it varies due to something. I think the 30 second loop should be long enough though.