Unbound problems after update to latest beta



  • Anybody else noticing problems with unbound, probably after those commits:

    https://github.com/pfsense/pfsense/commit/38d110824c87ff60c6289c0432d55009586ceee4
    https://github.com/pfsense/pfsense/commit/38d110824c87ff60c6289c0432d55009586ceee4

    Rebooting takes longer than before and take ages to resume after showing "Setting up gateway monitors…done."
    Starting unbound also takes really long now, stopping or restarting is fast.

    Running those by hand is fast, but unbound also seems to have stopped logging directly after reboot.
    It started logging again after the first time I started it by hand, did not log the initial stopping or starting after reboot

    
    [2.4.0-BETA][root@vpn2]/root: ps aux | grep unb
    unbound 47812   0.0  2.3  68700 22708  -  Is   10:40    0:00.23 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
    root    36828   0.0  0.2  14700  2404  0  S+   10:44    0:00.00 grep unb
    [2.4.0-BETA][root@vpn2]/root: /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf stop
    ok
    [2.4.0-BETA][root@vpn2]/root: /usr/local/sbin/unbound -c /var/unbound/unbound.conf
    [2.4.0-BETA][root@vpn2]/root: ps aux | grep unb
    unbound 37895   2.0  2.2  66652 22308  -  Ss   10:44    0:00.13 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
    root    38271   0.0  0.2  14700  2404  0  S+   10:44    0:00.00 grep unb
    [2.4.0-BETA][root@vpn2]/root: /usr/local/sbin/unbound-control -c /var/unbound/unbound.conf stop
    ok
    [2.4.0-BETA][root@vpn2]/root: /usr/local/sbin/unbound -c /var/unbound/unbound.conf
    [2.4.0-BETA][root@vpn2]/root:
    


  • +1 on slow reboot and overall speed of configuring unbound…


  • Rebel Alliance Developer Netgate

    The main difference is that it is not stomping all over itself to restart unbound now. It's possible it was fast before because at some point it was actually failing to restart and then something came along later in the boot process and kicked it again.

    Does it actually improve if you back out just https://github.com/pfsense/pfsense/commit/38d110824c87ff60c6289c0432d55009586ceee4 ? Or do you also have to back out https://github.com/pfsense/pfsense/commit/8a0aa42c197361ebb82387e5bdc8378e5440837f to make it fast again?

    I wouldn't be surprised to find out that https://github.com/pfsense/pfsense/commit/8a0aa42c197361ebb82387e5bdc8378e5440837f had a larger impact because before that commit, any calls to unbound-control were improperly formed and thus did nothing, now they do.

    Drop a note with feedback on https://redmine.pfsense.org/issues/7326 so others following the original issue can see if they can confirm the problem and/or fix.


  • Banned

    Also having unbound issues after update


  • Banned

    Any way to roll back to old version for the meantime?


  • Banned

    After reboot into new build can't access webgui until I restart webconfiguratpr from ssh. Then only by IP hostname won't work.

    When I do get in, dhcpd, ntpd, suricata, bandwidths, dnsbl services are all down. Also no VPN client will connect.

    My setup is effectively broken after this update.


  • Rebel Alliance Developer Netgate

    That does not sound related. Start your own thread and post as much detail (console output, system log contents, etc) as possible.

    If you suspect these changes are related, use the System Patches package to revert those commits.


  • Banned

    How do I revert to an old commit with system patches

    OK, I had to workaround a bit to be get the patch fetch to resolve but I reverted those two and all is well now.
    https://forum.pfsense.org/index.php?topic=133164.msg731939#msg731939

    At least for me the latest snapshot for Unbound takes down my entire network.

    Thanks for the help!



  • @jimp:

    Does it actually improve if you back out just https://github.com/pfsense/pfsense/commit/38d110824c87ff60c6289c0432d55009586ceee4 ? Or do you also have to back out https://github.com/pfsense/pfsense/commit/8a0aa42c197361ebb82387e5bdc8378e5440837f to make it fast again?

    I just gave it a quick test on my secondary HA gateway. Backing out the first patch was sufficient, unbound starts fast at boot, stops and restarts fine again.
    Anything else I should test?



  • Yeah same here.
    First one is enough.


  • Rebel Alliance Developer Netgate

    OK, I backed out that commit, next new snapshots should be OK again. We'll have to keep looking for a fix for https://redmine.pfsense.org/issues/7326, that may be hitting a few users but this seems to negatively impact more.

    Those of you who had problems with unbound starting/stopping properly, what services are enabled on these firewalls? And what packages are installed?



  • @jimp:

    Those of you who had problems with unbound starting/stopping properly, what services are enabled on these firewalls? And what packages are installed?

    Thanks!

    Services:

    dhcpd
    dpinger
    ftp-proxy
    ntpd
    openvpn
    sshd
    syslogd
    unbound
    zabbix_agentd_lts

    Packages:

    FTP_Client_Proxy 0.3_3  
    System_Patches 1.1.6_1
    zabbix-agent 0.8.9_3


  • Rebel Alliance Developer Netgate

    Could everyone who had problems running unbound with the previous commit please try the attached patch.

    It doesn't use unbound-control to stop unbound, but just the kill as before, plus the delay loop to be certain it stopped before moving on.

    It is also available at https://redmine.pfsense.org/issues/7326#note-21

    Use the system patches package to apply the change.

    unbound-stop.diff.txt



  • Hmm, how would I use this with the System Patches Package?

    /usr/bin/patch --directory=/ -t -p1 -i /var/patches/596064c796ffb.patch --check --forward --ignore-whitespace
    
    Hmm...  Looks like a unified diff to me...
    The text leading up to this was:
    --------------------------
    |diff --git a/src/etc/inc/services.inc b/src/etc/inc/services.inc
    |index ffc4aa8..0e2cfad 100644
    |--- a/src/etc/inc/services.inc
    |+++ b/src/etc/inc/services.inc
    --------------------------
    No file to patch.  Skipping...
    Hunk #1 ignored at 2235.
    Hunk #2 ignored at 2261.
    2 out of 2 hunks ignored while patching src/etc/inc/services.inc
    done
    

    Edit: Never mind, I removed "/src" instances and it worked.

    Looks better with this patch, everything I tested seems to be as fast as always.


  • Rebel Alliance Developer Netgate

    You can set the Path Strip to 2 in the system patches package entry and then you don't have to edit anything.

    Once a couple more people try it out I'll commit that change, probably Monday.



  • Just a newb question here, but after the change is committed an upgrade will implement it, correct?  Just making sure I won't still have to implement the patch at that point if I'm not a fresh install.  I'm currently not experiencing any issues but thought I'd ask.  Thanks!


  • Rebel Alliance Developer Netgate

    @nmiller0113:

    Just a newb question here, but after the change is committed an upgrade will implement it, correct?  Just making sure I won't still have to implement the patch at that point if I'm not a fresh install.  I'm currently not experiencing any issues but thought I'd ask.  Thanks!

    Correct. If the patch gets committed, it will be included in whatever the next snapshot is after it gets committed.

    How fast it gets committed depends entirely on people who experienced the previous issues testing it and providing feedback, though.



  • Jim's first reply is correct, I was the one who did the original bug report on this.

    Previously unbound was quick on boot because it didnt actually check if it had started ok, it simply ran a kill command immediately followed by a start command (this was actually done on the wan ip change process). The problem was if the kill process had not finished then the start command would fail but the pfsense boot scripts were unaware of this.

    My original proposed fix was to issue a shutdown command to unbound to stop it, which can take time to complete especially when there is a large dnsbl list.  Hence the delay on boot and restarts.

    Better to have the delay than a service that hasnt started.

    bbcan177 the developer of pfblockerng confirmed the bug also.


  • Rebel Alliance Developer Netgate

    @chrcoluk:

    Better to have the delay than a service that hasnt started.

    That was my thought as well, except in practice this led to people not having any running unbound instance, or one left running but in a weird/broken state (e.g. stuck at 100% CPU).

    Unbound's official docs actually say to stop it with kill, which is still what my last patch does, but it also waits ~30 to let it stop nicely before attempting to start it again.



  • yep I can see from the posts here the fix hasnt worked for everyone, hopefully these guys will test the new patch to confirm its good for them.


  • Rebel Alliance Developer Netgate

    I hope so, too. I have not received any feedback about the patch I posted here (or on the Redmine ticket). I could not reproduce the problems here, so I'd rather not commit it without verifying it works properly on firewalls that exhibited the previous problems.



  • @jimp:

    I hope so, too. I have not received any feedback about the patch I posted here (or on the Redmine ticket). I could not reproduce the problems here, so I'd rather not commit it without verifying it works properly on firewalls that exhibited the previous problems.

    Hmm, in reply #13 I gave feedback, I believe. Works fine, I have not upgraded my installation since.


  • Rebel Alliance Developer Netgate

    @athurdent:

    @jimp:

    I hope so, too. I have not received any feedback about the patch I posted here (or on the Redmine ticket). I could not reproduce the problems here, so I'd rather not commit it without verifying it works properly on firewalls that exhibited the previous problems.

    Hmm, in reply #13 I gave feedback, I believe. Works fine, I have not upgraded my installation since.

    I must have missed the "edit:". I do not get notified on post edits like I do for replies, it's not a very effective way to convey an update like that.



  • Ah, sorry, I'll post a second answer for things like this in the future. But it would be great if the others in this thread could give it a shot, too.


  • Rebel Alliance Developer Netgate

    I went ahead and committed the updated patch, we'll see how that goes.



  • I can't say whether the issue I'm experiencing is related to this one, but it's been consistent for a while that I have to restart unbound after updating or rebooting pfsense or else nslookup and dig will fail. This is only using the snapshot, not 2.3.4. I am using the "do not wait for RA" feature. Otherwise my system is quite simple and uses defaults almost exclusively.

    Here are the details: https://forum.pfsense.org/index.php?topic=132181.0



  • I'm not sure what's going on here, but while swapping modems this afternoon (upgraded to a DOCSIS 3.1 modem), I had my pfSense WAN interface not connected for a bit. After reconnecting it, and just now looking in the DNS Resolver logs, I have extended the log view up to 5000 lines and over 95% of the lines are…

    Jul 18 17:30:46	unbound	82208:1	error: can't bind socket: Can't assign requested address for x.x.x.x
    

    or replace the IPv4 address with an IPv6 address. The timestamp is the exact same to the minute, with a variance of 3 seconds over more than 4900 lines. That's some MAJOR log spamming going on while the connection is down.

    Maybe it's always done that and I just haven't noticed (I don't tend to look at dns resolver logs often)… but that's pretty severe to be writing over 4900 lines to a log file in just three seconds. Possibly related? If not, please feel free to split this to a new topic.