Unbound stops resolving - only a complete Firewall reboot will fix it



  • Running 2.4.4-RELEASE-p2 (amd64) and using Unbound as my DNS Resolver. Works for a while, then all of the sudden, I'm unable to resolve DNS from any device on my network. Unbound service restart does not fix the problem nor does it throw any errors. If I restart the Firewall, everything is good when it comes back up.

    I also have three OpenVPN client connections. I've been having a problem lately, where my VPN connections go down very briefly. After investigation, looks like this happens when I briefly (for a second or two) lose my Internet connection. It is a Fixed Wireless LTE connection, and the modem seems to disconnect and then reconnect very briefly.

    The Unbound issues seem to coincide time wise with this brief interruption in Internet and OpenVPN connection drops. The OpenVPN connections re-establish quickly, but Unbound stops working.

    Here's what I see in the logs...

    Under Gateways:
    Jan 28 06:27:07 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.209.1 bind_addr 192.168.209.13 identifier "XPLORNET1_DHCP "
    Jan 28 06:27:07 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.120.0.5 bind_addr 10.120.0.6 identifier "UKVPN_DHCP "
    Jan 28 06:27:07 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.36.0.77 bind_addr 10.36.0.78 identifier "USVPN_DHCP "
    Jan 28 06:27:07 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.159.0.61 bind_addr 10.159.0.62 identifier "CANVPN_VPNV4 "
    Jan 28 06:27:09 dpinger USVPN_DHCP 10.36.0.77: Alarm latency 0us stddev 0us loss 100%
    Jan 28 06:27:09 dpinger CANVPN_VPNV4 10.159.0.61: Alarm latency 0us stddev 0us loss 100%
    Jan 28 06:27:09 dpinger UKVPN_DHCP 10.120.0.5: Alarm latency 0us stddev 0us loss 100%

    Under DNS Resolver:
    Jan 28 06:27:09 unbound 40968:0 info: server stats for thread 3: requestlist max 0 avg 0 exceeded 0 jostled 0
    Jan 28 06:27:12 unbound 38062:0 notice: init module 0: validator
    Jan 28 06:27:12 unbound 38062:0 notice: init module 1: iterator
    Jan 28 06:27:12 unbound 38062:0 info: start of service (unbound 1.8.1).
    Jan 28 06:27:22 unbound 38062:0 info: generate keytag query _ta-4f66. NULL IN

    Any help or hints would be appreciated.

    Peter



  • When unbound fail it normally output to Status / System / Logs / System /General

    Jan 26 15:46:14 kernel pid 28159 (unbound), uid 59: exited on signal 11



  • I'm not seeing that error, but I do see this one related to unbound. And it's in and around the time DNS started failing.

    Jan 28 05:33:39 php-fpm 443 /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1548668019] unbound[41947:0] error: can't bind socket: Address already in use for 127.0.0.1 port 953 [1548668019] unbound[41947:0] error: cannot open control interface 127.0.0.1 953 [1548668019] unbound[41947:0] fatal error: could not open ports'
    Jan 28 05:33:40 php-fpm 442 /rc.newwanip: Creating rrd update script
    Jan 28 05:33:40 php-fpm 38948 /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1548668020] unbound[96004:0] error: can't bind socket: Address already in use for 127.0.0.1 port 953 [1548668020] unbound[96004:0] error: cannot open control interface 127.0.0.1 953 [1548668020] unbound[96004:0] fatal error: could not open ports'
    Jan 28 05:33:41 php-fpm 443 /rc.newwanip: Creating rrd update script

    Do these errors mean anything?

    Peter



  • That when unbound start before the previous unbound process exit. It can not bind the the port and exit.

    You can check if unbound is still running with a shell command :

    ps -axwwl | grep unbound
    


  • RonpfS,

    Not sure how that will help me. When DNS stops resolving for my clients, do I run this script to see if the service is still running? Are you thinking that maybe my Unbound service is being shutdown when my connection fails, but it actually doesn't shutdown fully. Then upon reconnect, it attempts to restart, but fails because it's still kind of running?

    Thanks for your help.

    Peter





  • Raffi,

    Yes, that sounds like my issue exactly. Problem is, if I'm interpreting the Bug thread correctly (#7326), it looks like the patch has been incorporated into the code already, and I'm running 2.4.4-RELEASE-p2 (amd64).

    However, in the thread, there's also a suggestion that the same fix might need to be made to OpenVPN code - "However, it seems that there are other places in the pfSense code (ie: OpenVPN) that also need to be patched for this issue... So maybe this code should be placed elsewhere to cover all other instances of restarting Unbound."

    It isn't obvious that this was ever done, so maybe that's my problem.

    Where do I go from here?

    Thanks,
    Peter



  • I'm assuming the other suggestions in that thread about the watchdog service and pfblocker don't apply? Were you able to run the suggestion above when the issue happens?

    It could be helpful if you can provide more information such as any packages installed. Screenshots of the Resolver settings might help as well.



  • Hi Raffi,

    I am not running watchdog or pfblocker. In fact, I have zero packages installed.

    I am at the mercy of my ISP to run into the situation again and run the script. Seems to be tied to a very brief modem disconnect followed by OpenVPN client disconnects and reconnects. I guess I could simulate this by unplugging my modem for a second or two? I am not at home right now, but I'll try that when I get home.

    I am attaching my Resolver settings.

    Peter

    0_1548775836154_Resolver.jpg
    0_1548775842824_Resolver Advanced.jpg



  • OK. I tried to recreate the problem by cutting my modem connection long enough for the Gateway to go down. Unbound was resolving upon reconnect, however, versus hanging up like it has been. Results of script:

    59 11006 1 0 20 0 48640 24580 kqread Ss - 0:00.40 /usr/local/sbin/unbound -c /var/unbound/unbound.conf
    0 64376 89674 0 24 0 6968 2804 wait S - 0:00.00 sh -c ps -axwwl | grep unbound 2>&1
    0 64760 64376 0 23 0 6564 2456 piperd S - 0:00.00 grep unbound

    Interesting log entries none-the-less:

    General
    Jan 29 20:10:20 check_reload_status Linkup starting e6000sw0port1
    Jan 29 20:10:20 kernel e6000sw0port1: link state changed to DOWN
    Jan 29 20:10:21 check_reload_status Reloading filter
    Jan 29 20:11:35 check_reload_status Linkup starting e6000sw0port1
    Jan 29 20:11:35 kernel e6000sw0port1: link state changed to UP
    Jan 29 20:11:36 check_reload_status Reloading filter
    Jan 29 20:11:41 kernel ovpnc2: link state changed to DOWN
    Jan 29 20:11:41 check_reload_status Reloading filter
    Jan 29 20:11:42 kernel ovpnc2: link state changed to UP
    Jan 29 20:11:42 check_reload_status rc.newwanip starting ovpnc2
    Jan 29 20:11:43 php-fpm 71221 /rc.newwanip: rc.newwanip: Info: starting on ovpnc2.
    Jan 29 20:11:43 php-fpm 71221 /rc.newwanip: rc.newwanip: on (IP address: 10.41.0.126) (interface: USVPN[opt3]) (real interface: ovpnc2).
    Jan 29 20:11:43 php-fpm 71221 /rc.newwanip: IP Address has changed, killing states on former IP Address 10.41.0.42.
    Jan 29 20:11:44 kernel ovpnc3: link state changed to DOWN
    Jan 29 20:11:44 check_reload_status Reloading filter
    Jan 29 20:11:44 kernel ovpnc1: link state changed to DOWN
    Jan 29 20:11:44 check_reload_status Reloading filter
    Jan 29 20:11:45 kernel ovpnc3: link state changed to UP
    Jan 29 20:11:45 check_reload_status rc.newwanip starting ovpnc3
    Jan 29 20:11:45 kernel ovpnc1: link state changed to UP
    Jan 29 20:11:45 check_reload_status rc.newwanip starting ovpnc1
    Jan 29 20:11:45 php-fpm 71221 /rc.newwanip: Gateway, none 'available' for inet6, use the first one configured. ''
    Jan 29 20:11:46 php-fpm 350 /rc.newwanip: rc.newwanip: Info: starting on ovpnc3.
    Jan 29 20:11:46 php-fpm 350 /rc.newwanip: rc.newwanip: on (IP address: 10.159.0.26) (interface: CANVPN[opt4]) (real interface: ovpnc3).
    Jan 29 20:11:46 php-fpm 350 /rc.newwanip: IP Address has changed, killing states on former IP Address 10.159.0.62.
    Jan 29 20:11:46 php-fpm 349 /rc.newwanip: rc.newwanip: Info: starting on ovpnc1.
    Jan 29 20:11:46 php-fpm 349 /rc.newwanip: rc.newwanip: on (IP address: 10.25.0.46) (interface: UKVPN[opt2]) (real interface: ovpnc1).
    Jan 29 20:11:46 php-fpm 349 /rc.newwanip: IP Address has changed, killing states on former IP Address 10.25.0.6.
    Jan 29 20:11:48 php-fpm 350 /rc.newwanip: Gateway, none 'available' for inet6, use the first one configured. ''
    Jan 29 20:11:50 php-fpm 349 /rc.newwanip: Gateway, none 'available' for inet6, use the first one configured. ''
    Jan 29 20:11:50 php-fpm 350 /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1548807110] unbound[10710:0] error: can't bind socket: Address already in use for 127.0.0.1 port 953 [1548807110] unbound[10710:0] error: cannot open control interface 127.0.0.1 953 [1548807110] unbound[10710:0] fatal error: could not open ports'
    Jan 29 20:11:51 php-fpm 71221 /rc.newwanip: Creating rrd update script
    Jan 29 20:11:52 php-fpm 349 /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1548807112] unbound[69139:0] error: can't bind socket: Address already in use for 127.0.0.1 port 953 [1548807112] unbound[69139:0] error: cannot open control interface 127.0.0.1 953 [1548807112] unbound[69139:0] fatal error: could not open ports'
    Jan 29 20:11:53 php-fpm 350 /rc.newwanip: Creating rrd update script
    Jan 29 20:11:53 php-fpm 71221 /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 10.41.0.42 -> 10.41.0.126 - Restarting packages.
    Jan 29 20:11:53 check_reload_status Starting packages
    Jan 29 20:11:54 php-fpm 349 /rc.newwanip: Creating rrd update script
    Jan 29 20:11:54 php-fpm 71221 /rc.start_packages: Restarting/Starting all packages.
    Jan 29 20:11:55 php-fpm 350 /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 10.159.0.62 -> 10.159.0.26 - Restarting packages.
    Jan 29 20:11:55 check_reload_status Starting packages
    Jan 29 20:11:56 php-fpm 89674 /rc.start_packages: Restarting/Starting all packages.
    Jan 29 20:11:56 php-fpm 349 /rc.newwanip: pfSense package system has detected an IP change or dynamic WAN reconnection - 10.25.0.6 -> 10.25.0.46 - Restarting packages.
    Jan 29 20:11:56 check_reload_status Starting packages
    Jan 29 20:11:57 php-fpm 430 /rc.start_packages: Restarting/Starting all packages.

    Gateways
    Jan 29 20:10:35 dpinger XPLORNET1_DHCP 192.168.209.1: Alarm latency 238us stddev 43us loss 21%
    Jan 29 20:11:43 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.209.1 bind_addr 192.168.209.13 identifier "XPLORNET1_DHCP "
    Jan 29 20:11:43 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.25.0.5 bind_addr 10.25.0.6 identifier "UKVPN_DHCP "
    Jan 29 20:11:43 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.41.0.125 bind_addr 10.41.0.126 identifier "USVPN_DHCP "
    Jan 29 20:11:43 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.159.0.61 bind_addr 10.159.0.62 identifier "CANVPN_VPNV4 "
    Jan 29 20:11:45 dpinger UKVPN_DHCP 10.25.0.5: Alarm latency 0us stddev 0us loss 100%
    Jan 29 20:11:45 dpinger USVPN_DHCP 10.41.0.125: Alarm latency 0us stddev 0us loss 100%
    Jan 29 20:11:45 dpinger CANVPN_VPNV4 10.159.0.61: Alarm latency 0us stddev 0us loss 100%
    Jan 29 20:11:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.209.1 bind_addr 192.168.209.13 identifier "XPLORNET1_DHCP "
    Jan 29 20:11:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.25.0.45 bind_addr 10.25.0.46 identifier "UKVPN_DHCP "
    Jan 29 20:11:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.41.0.125 bind_addr 10.41.0.126 identifier "USVPN_DHCP "
    Jan 29 20:11:46 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.159.0.25 bind_addr 10.159.0.26 identifier "CANVPN_VPNV4 "
    Jan 29 20:11:48 dpinger UKVPN_DHCP 10.25.0.45: Alarm latency 0us stddev 0us loss 100%
    Jan 29 20:11:48 dpinger USVPN_DHCP 10.41.0.125: Alarm latency 0us stddev 0us loss 100%
    Jan 29 20:11:48 dpinger CANVPN_VPNV4 10.159.0.25: Alarm latency 0us stddev 0us loss 100%
    Jan 29 20:11:48 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 192.168.209.1 bind_addr 192.168.209.13 identifier "XPLORNET1_DHCP "
    Jan 29 20:11:48 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.25.0.45 bind_addr 10.25.0.46 identifier "UKVPN_DHCP "
    Jan 29 20:11:48 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.41.0.125 bind_addr 10.41.0.126 identifier "USVPN_DHCP "
    Jan 29 20:11:48 dpinger send_interval 500ms loss_interval 2000ms time_period 60000ms report_interval 0ms data_len 0 alert_interval 1000ms latency_alarm 500ms loss_alarm 20% dest_addr 10.159.0.25 bind_addr 10.159.0.26 identifier "CANVPN_VPNV4 "
    Jan 29 20:11:50 dpinger UKVPN_DHCP 10.25.0.45: Alarm latency 0us stddev 0us loss 100%
    Jan 29 20:11:50 dpinger USVPN_DHCP 10.41.0.125: Alarm latency 0us stddev 0us loss 100%
    Jan 29 20:11:50 dpinger CANVPN_VPNV4 10.159.0.25: Alarm latency 0us stddev 0us loss 100%

    DNS Resolver
    Jan 29 20:11:45 unbound 91533:0 info: service stopped (unbound 1.8.1).
    Jan 29 20:11:45 unbound 91533:0 info: server stats for thread 0: 37689 queries, 19124 answers from cache, 18565 recursions, 2334 prefetch, 0 rejected by ip ratelimiting
    Jan 29 20:11:45 unbound 91533:0 info: server stats for thread 0: requestlist max 68 avg 8.15465 exceeded 0 jostled 0
    Jan 29 20:11:45 unbound 91533:0 info: average recursion processing time 9.705527 sec
    Jan 29 20:11:45 unbound 91533:0 info: histogram of recursion processing times
    Jan 29 20:11:45 unbound 91533:0 info: [25%]=0.0827779 median[50%]=0.198446 [75%]=0.859565
    Jan 29 20:11:45 unbound 91533:0 info: lower(secs) upper(secs) recursions
    Jan 29 20:11:45 unbound 91533:0 info: 0.000000 0.000001 1052
    Jan 29 20:11:45 unbound 91533:0 info: 0.000256 0.000512 3
    Jan 29 20:11:45 unbound 91533:0 info: 0.000512 0.001024 83
    Jan 29 20:11:45 unbound 91533:0 info: 0.001024 0.002048 1546
    Jan 29 20:11:45 unbound 91533:0 info: 0.002048 0.004096 140
    Jan 29 20:11:45 unbound 91533:0 info: 0.004096 0.008192 20
    Jan 29 20:11:45 unbound 91533:0 info: 0.008192 0.016384 21
    Jan 29 20:11:45 unbound 91533:0 info: 0.016384 0.032768 44
    Jan 29 20:11:45 unbound 91533:0 info: 0.032768 0.065536 706
    Jan 29 20:11:45 unbound 91533:0 info: 0.065536 0.131072 3877
    Jan 29 20:11:45 unbound 91533:0 info: 0.131072 0.262144 3459
    Jan 29 20:11:45 unbound 91533:0 info: 0.262144 0.524288 2233
    Jan 29 20:11:45 unbound 91533:0 info: 0.524288 1.000000 1023
    Jan 29 20:11:45 unbound 91533:0 info: 1.000000 2.000000 430
    Jan 29 20:11:45 unbound 91533:0 info: 2.000000 4.000000 408
    Jan 29 20:11:45 unbound 91533:0 info: 4.000000 8.000000 536
    Jan 29 20:11:45 unbound 91533:0 info: 8.000000 16.000000 612
    Jan 29 20:11:45 unbound 91533:0 info: 16.000000 32.000000 791
    Jan 29 20:11:45 unbound 91533:0 info: 32.000000 64.000000 785
    Jan 29 20:11:45 unbound 91533:0 info: 64.000000 128.000000 464
    Jan 29 20:11:45 unbound 91533:0 info: 128.000000 256.000000 203
    Jan 29 20:11:45 unbound 91533:0 info: 256.000000 512.000000 104
    Jan 29 20:11:45 unbound 91533:0 info: server stats for thread 1: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
    Jan 29 20:11:45 unbound 91533:0 info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
    Jan 29 20:11:45 unbound 91533:0 info: server stats for thread 2: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
    Jan 29 20:11:45 unbound 91533:0 info: server stats for thread 2: requestlist max 0 avg 0 exceeded 0 jostled 0
    Jan 29 20:11:45 unbound 91533:0 info: server stats for thread 3: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
    Jan 29 20:11:45 unbound 91533:0 info: server stats for thread 3: requestlist max 0 avg 0 exceeded 0 jostled 0
    Jan 29 20:11:50 unbound 11006:0 notice: init module 0: validator
    Jan 29 20:11:50 unbound 11006:0 notice: init module 1: iterator
    Jan 29 20:11:50 unbound 11006:0 info: start of service (unbound 1.8.1).
    Jan 29 20:11:52 unbound 11006:0 info: generate keytag query _ta-4f66. NULL IN

    Does this plus my previous post of my Unbound settings help any?

    Thanks,
    Peter



  • @plawlor said in Unbound stops resolving - only a complete Firewall reboot will fix it:

    Jan 29 20:11:50 php-fpm 350 /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1548807110] unbound[10710:0] error: can't bind socket: Address already in use for 127.0.0.1 port 953 [1548807110] unbound[10710:0] error: cannot open control interface 127.0.0.1 953 [1548807110] unbound[10710:0] fatal error: could not open ports'
    Jan 29 20:11:51 php-fpm 71221 /rc.newwanip: Creating rrd update script
    Jan 29 20:11:52 php-fpm 349 /rc.newwanip: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was '[1548807112] unbound[69139:0] error: can't bind socket: Address already in use for 127.0.0.1 port 953 [1548807112] unbound[69139:0] error: cannot open control interface 127.0.0.1 953 [1548807112] unbound[69139:0] fatal error: could not open ports'
    

    Those failures are normal when WAN toggle. It will start a few times before the WAN settle. In the end it looks like unbound started fine. Check again at next failure.



  • I have the same problem!



  • @plawlor : I had a look at your logs.

    For every IP (WAN ? VPN ?) that comes UP, rc.newwanip is called.
    rc.newwanip will call services_unbound_configure() that checks if unbound is set to run (unbound is activated) and then if will terminate unbound, to have it started right away.

    You logs shows that this happen a first time, right after
    @plawlor said in Unbound stops resolving - only a complete Firewall reboot will fix it:

    Jan 29 20:11:43 php-fpm 71221 /rc.newwanip: rc.newwanip: Info: starting on ovpnc2

    On 30:11:45 unbound restarts.

    As you can see, you have 3 "/rc.newwanip: rc.newwanip: Info: starting on xxxx" calls.
    One for ovpn2, 3 and - all within a second :43, :44 and :45
    This probably goes to fast.
    When unbound is killed, the code waits until PID file has been removed, and an extra second ("sleep(1)") is used, and then unbound is restart again.
    The issue is - again : probably - that the same Kill and restart sequence comes right back again ... I guess the 'kill' goes unnoticed, because the PID file doesn't exist yet - unbound is just about to wake up. The sleep(1) is executed .... and unbound is started and buooooam : it's done creating it's process is now running.

    If you are capable of editing a file, I have a test for you ... a simple extra delay.

    In the file /etc/inc/services.inc, locate

    function services_unbound_configure($restart_dhcp = true) {
    

    From this point, search

    sync_unbound_service();
    

    Right before that point, add :

    		if ( !platform_booting()) {
    			sleep(1);
    			log_error("DNS : Info : Sleeping 1 sec after start.");
    		}
    

Log in to reply