Intermittent DNS problems



  • Hi!

    Sometimes, maybe a couple of times a month, I get DNS problems. Suddenly I cannot reach any sites at all, only by IP adddress. Then I need to uncheck "Do not use the DNS Forwarder or Resolver as a DNS server for the firewall " in General Setup, and it is working again. Does anyone know why this is happening? Is there a log I should enclose here?

    Would appreciate any help  :)

    Regards

    Tommy


  • LAYER 8 Global Moderator

    what is pfsense pointing to if your not using itself..

    If your pointing pfsense to itself using the forward or resolver and it can not resolver, than either that service has crashed or locked up or can not query where either your forwarding to or actually resolve.

    Are you using the resolver which has been default for quite some time or are you using the forwarder?



  • I've been observing similar behavior recently. I use the Unbound resolver in the default non-forwarding mode. The symptoms are that large parts of the domains are unresolvable because the resolver hangs. I have tracked the problem partly to the connections made by the resolver to the root name servers and/or the name servers for .com/.org top level domains, those connections just hang and never return. Usually the problem resolves itself in 10-20 minutes but I've solved the problem usually by switching to forwarding mode with Google's DNS forwarders.


  • LAYER 8 Global Moderator

    did you try doing queries directly to the root servers, seems like you might have a network issue upstream if you can not connect to the root servers.



  • I'm sorry I cannot contribute too much on this, I'm not that good in linux/pfsense I'm afraid :)

    I don't know if it works again after a while, because I never waited too long. But I remember unchecking "Do not use the DNS Forwarder or Resolver as a DNS server for the firewall" last time, and I never checked it again. But when the problem came back yesterday, it was checked again, and I had to uncheck it. Does it check it self after a restart? Shouldn't do that, should it?


  • LAYER 8 Global Moderator

    no it doesn't check itself again after restart…

    Well how are we going to find the root of the problem if can not do a simple query to a name server?  How do you think its the root servers if you can not even do a query??  Why do you think its the how do you know your having problems with .org and .com tld servers... there are LOTS of them, and they are different..

    ;; QUESTION SECTION:
    ;org.                          IN      NS

    ;; ANSWER SECTION:
    org.                    86400  IN      NS      a0.org.afilias-nst.info.
    org.                    86400  IN      NS      a2.org.afilias-nst.info.
    org.                    86400  IN      NS      b0.org.afilias-nst.org.
    org.                    86400  IN      NS      b2.org.afilias-nst.org.
    org.                    86400  IN      NS      c0.org.afilias-nst.info.
    org.                    86400  IN      NS      d0.org.afilias-nst.org.

    ;; QUESTION SECTION:
    ;com.                          IN      NS

    ;; ANSWER SECTION:
    com.                    172800  IN      NS      c.gtld-servers.net.
    com.                    172800  IN      NS      h.gtld-servers.net.
    com.                    172800  IN      NS      f.gtld-servers.net.
    com.                    172800  IN      NS      m.gtld-servers.net.
    com.                    172800  IN      NS      g.gtld-servers.net.
    com.                    172800  IN      NS      k.gtld-servers.net.
    com.                    172800  IN      NS      b.gtld-servers.net.
    com.                    172800  IN      NS      j.gtld-servers.net.
    com.                    172800  IN      NS      a.gtld-servers.net.
    com.                    172800  IN      NS      i.gtld-servers.net.
    com.                    172800  IN      NS      e.gtld-servers.net.
    com.                    172800  IN      NS      l.gtld-servers.net.
    com.                    172800  IN      NS      d.gtld-servers.net.

    what is a specific fqdn your having a problem resolving?



  • In my case I did verify the connectivity problem by querying the root servers and the .com/.org name servers directly. The problem hasn't reappeared now for few days so I can't offer more information at the moment.


  • LAYER 8 Global Moderator

    are you using snort or the other one Suricata there seems to have been some issues with that one blocking accessing to root servers.



  • No, I have just a basic firewall set up without any packages installed.


  • LAYER 8 Global Moderator

    well if there was some networking issue that prevented you from talking to the name servers for those tlds, then yup you would have a very hard time resolving any domains in those tlds, since pfsense needs to ask those name servers for the owning servers of whatever domain your looking up in those tlds, and then go ask those name servers.

    I am a huge fan of actual resolver vs forwarding, but if you have bad connectivity it can be problematic.. Since you need to be able to query pretty much anywhere on the planet that someone is running their name servers for the domain your looking to query.  you might not have any issues getting to where a specific host is setup, but if that domain has crappy dns you could have problems..



  • Happened again today. Resolution of all domains stopped and I could not query any of the root servers using drill @server. Mtr traces and pings worked on the root servers but no queries were replied. I have no good ideas but few guesses. First one is that there is some kind of rate limiting going on with the root servers and if you're a second class net citizen, on a dynamic IP address in other words, you could rate limited when there's enough load on the root servers. Second one is that my ISP is doing some maintenance now and then and they somehow block DNS queries that are going outside a set of known addresses during those maintenance breaks. Such maintenance would be undetectable for 99.99% of their customers who happily use their forwarders or google forwarders.


  • LAYER 8 Global Moderator

    Both of those guesses seem unlikely to me… Root servers sure are not going to rate limit that I have ever heard and how would that work to all of them?  All on different networks all at the same time?

    Did you try to query something else other than root servers, like a NS server some specific domain, or a known resolver like googledns or opendns or 4.2.2.2, etc..



  • I'd also like to share some input as I am experiencing the exact same problem. I believe it started a little over a week ago - I don't recall making any specific changes on my firewall that would have caused this - the only change that was made was I upgraded my internet speed and it required a new modem (different brand) due to the addition of a telephone line, but I noticed these problems after about a day or two and swapped out my modem with another one (still the new brand).  However after troubleshooting the problem and making no progress - the issue was intermittent and very difficult to narrow down so I just decided to try using the 2.3 RC and restore from a backup.

    After the 2.3 RC clean install I noticed DNS was still failing, this was because unbound wasn't starting as PFBlockerNG wasn't installed but the configuration line was still in the settings of unbound "server:include: /var/unbound/pfb_dnsbl.conf" I removed this and unbound was able to start and resolve DNS.  It wasn't very long before the same DNS problems started again.  Resolution seemed to time out and fail (If I bypassed pfsense manually I could query the remote DNS server fine, tested with 8.8.8.8, 4.2.2.1-6, and a couple of the a-f.root-servers.net IP's).  After troubleshooting for 2 days I thought perhaps something during the restore may have caused the issue to reappear so I decided to wipe the firewall again and stick to the 2.2.6 version to assure there weren't any potential bugs I'd be dealing with and then simply restore my vlans, interfaces, aliases and firewall rules.  I would then manually reinstall packages one at a time to see if the problem reappeared.

    Once I restored the vlans, interfaces, aliases and firewall rules I started installing packages, pfblockerNG, Snort, Squid, Service Watchdog and at this point everything was running fine.  DNS resolution was quick and the proxy was working great (I was kind of surprised at this point to be honest).  There was one specific step I remember performing which caused the DNS problems to reappear.  This was when I added a new WAN interface - I have a cable modem that is basically bridged to one physical nic and multiple vnics giving me multiple public IP addresses.  Two of the WAN links had an active IP address assigned however once I renewed the IP address on the third link and rebooted the firewall all the DNS problems started happening again.  Once I saw this I was kind of relieved because there were so many other settings I needed restored (several certificates for client site VPN's mainly), so knowing the problem wasn't really specific to my settings I wiped the firewall again to a 2.2.6 version and performed a full restore.

    At this point the DNS is still hit or miss and an odd thing I'm noticing is that the DNS for the firewall itself will not resolve using Unbound as either a Resolver or Forwarder.  If I disable Unbound and apply the changes the firewall will resolve DNS for package installs, system update status, RSS feed, etc… but with Unbound enabled it does not work.

    And as for clients, the resolution is still hit or miss on occasion - sometimes it kind of acts as though it's using the proxy with a corrupt cache since the pages load odd or stripped down with a weird layout or missing pictures (this is most likely the DNS failing to resolve portions for the site when loading) - I see this behavior however squid isn't even installed, typically a refresh fixes the problem.

    If there's any additional steps I can take to help narrow down the problem please let me know.  The logs don't really provide much info regarding any failed DNS.

    Server Specs
    Supermicro SYS-5018A-FTN4
    Intel Atom 2.4GHz
    32gb Ram
    Samsung 850 PRO 128GB


  • LAYER 8 Global Moderator

    So what query interfaces do you have selected in unbound?  All of them?  Just pick the interface you want unbound to use to query with.

    "I could query the remote DNS server fine"

    This is where your problem is different than the OP, he states he can not query external dns via direct query..




  • Actually I only have about 6 out of 15 internal interfaces selected for inbound queries and two WAN interfaces selected for outbound.

    I will say I believe I found my problem though…I decided to disable my 3rd WAN interface, apply the settings in pfSense, then reboot the modem since it needs to see the updated MAC addresses connected, then finally reboot pfSense again - afterwards everything was running normal.  I haven't had this issue before but this is a really odd procedure to get the modem, ESXi and pfSense to properly register the WAN interfaces.  The problem seems to only be when I have more than 2 WAN interfaces active, I've had this working in the past but as soon as I setup a 3rd interface it starts having problems routing traffic.  When I was having the problem I noticed a lot of TCP re-transmissions in a few packet captures I took, I'm not sure why it's getting choked up but at least I found the issue.  Does anyone else have more than 2 WAN links active using DHCP?  Keep in mind the IP address leases provided in DHCP do not share the same gateway address.


  • LAYER 8 Global Moderator

    What does your routing look like with 3 wan, are you creating wan groups?  Are they all set default?

    Why do you have 3 wan?  They are all 3 public from your ISP in different networks?  Or all on the same rfc1918 space?



  • Just wanted to chime in…happened again today. Could not reach any sites. Checked and unchecked "Do not use the DNS Forwarder or Resolver as a DNS server for the firewall", and its working fine again. Strange.  :P

    Edit: Looks like I have the same problems as the people in this thread: https://forum.pfsense.org/index.php?topic=103714.0

    I get the message "kernel: pid 40874 (dnsmasq), uid 65534: exited on signal 11" in system log when the internet goes down.

    Seems like an update to 2.3.0 will fix the issue :)


Log in to reply