Unbound failure after power failure - how to prevent? [solved enough]



  • This morning when I came into work, I found a complete network-wide DNS outage. We are set up with pfSense providing DNS Resolver (Unbound) for the network.

    Looking through my cameras/logs/etc., it seems we had a power outage which exhausted the UPS system for the router. Unfortunately, I do not yet have any UPS monitoring properly setup for the router. As a result, the system hard-shutdown.

    When everything came back up, unbound did not. When I arrived (hours later), I could not manually start unbound (no issues in logs, just an attempt to start and then it wasn't started). I immediately switched over to the Forwarder, then changed the listen port and log level on Unbound to debug. At this point unbound started properly. I was then able to switch back to using unbound as normal.

    My theory, based on similar posts in the past, is that the system was shutdown in the middle of a file write related to unbound, and that resetting the config was able to overcome this.

    As a stopgap, I added google DNS to the DHCP options for my network, so that we can bypass the router if an issue like this arises. However, for local resolving to continue working, I'd like to make sure this issue doesn't happen again. Short of properly setting up the UPS to shutdown the system (which is in the works), is there a way to have the config automatically sanify itself if unbound fails to start?


  • LAYER 8 Global Moderator

    @mkernalcon said in Unbound failure after power failure - how to prevent?:

    I added google DNS to the DHCP options for my network

    Doesn't work that way - if you hand clients more than 1 dns, you have no idea which one they will use. So you could start seeing clients not able to resolve local resources. Even when your local dns is working fine.



  • @johnpoz said in Unbound failure after power failure - how to prevent?:

    Doesn't work that way - if you hand clients more than 1 dns, you have no idea which one they will use. So you could start seeing clients not able to resolve local resources. Even when your local dns is working fine.

    My thought was that even if they check and fail to resolve on public DNS, they'd go through the rest of their list until one resolves. Is this not accurate? (And theoretically they should try the first server first, but I know that's not a guarantee)

    Obviously this isn't my favorite solution, so if I can solve the unbound problem, I will revert to exclusively using the router as intended.



  • @mkernalcon said in Unbound failure after power failure - how to prevent?:

    it seems we had a power outage which exhausted the UPS system for the router

    The primary usage of a UPS is : provoking an automated controlled shut down for all logically connected devices when the power goes down - and stays down for more then X minutes.
    The UPS will 'bridge' any very short power outages by using it's battery. That's just a nice advantage, but not the main goal of an UPS.

    An UPS should "communicate by any means" with the critical devices it delivers power to.

    What happened is that, UPS or not, you had a power outage. This can provoke** a file system failure.
    When the device start up again, you must check the file system. (The pfSense manual and this forum speaks often about this procedure)

    Also : this "UPS + devices" setup should be tested like ones in a month. Just rip out the main power plug and see what happens. These protected devices should proceed with a ordinary shut down after the X minutes. If they don't, review and retest your setup.

    So, the solution : I advise you to finish your UPS setup. There is a little bit more involved as solely using the UPS as a simple multiple power outlet ^^

    ** try it out for yourself : take an ordinary Windows PC and rip out the wall power plug - and restart your PC.
    I'll bet you have majors boot problems within 10 tries.
    So, please, just believe me - and do not tries this @home.



  • @mkernalcon said in Unbound failure after power failure - how to prevent?:

    @johnpoz said in Unbound failure after power failure - how to prevent?:

    Doesn't work that way - if you hand clients more than 1 dns, you have no idea which one they will use. So you could start seeing clients not able to resolve local resources. Even when your local dns is working fine.

    My thought was that even if they check and fail to resolve on public DNS, they'd go through the rest of their list until one resolves. Is this not accurate? (And theoretically they should try the first server first, but I know that's not a guarantee)

    No, that's not correct. The clients would only try an alternate DNS if the first one they attempted to contact did not reply at all. If the DNS server they ask first returns a "NXDOMAIN" response (indicating the requested domain name does not exist), the client will stop asking any other servers since it got a response. And the response was "there is no such domain on the Internet by that name".

    In the case of a local domain defined only for your internal LAN, then Google's DNS and everyone else on the Internet will have no record of that domain and thus those public DNS servers will return a NXDOMAIN response when your clients ask them for the IP address of an internal host.



  • Well, so much for that idea.

    I've reverted back to local-only resolving, and I have a better UPS system coming, which I will actually take the time to get set up properly. Thanks for the input all around.



  • @Gertjan said in Unbound failure after power failure - how to prevent? [solved enough]:

    try it out for yourself : take an ordinary Windows PC and rip out the wall power plug - and restart your PC.
    I'll bet you have majors boot problems within 10 tries.
    So, please, just believe me - and do not tries this @home

    Several years ago, I worked at IBM. One day I got a call from someone whose computer wouldn't boot. Her disk was full of garbage. It turned out at the end of the day she'd just turn off the power bar, instead of doing a proper shut down.


Log in to reply