DNS Resolver host overrides don't work SOMETIMES



  • Hi Dear All :)

    Found some interesting problem. Zimbra setup behind NAT (pfSense) requires some split-zone DNS setup, i.e. to override it's public hostname (which is also MX) to local Zimbra's IP.

    Zimbra uses pfSense as its DNS server. Here what I found by reviewing mail logs after user's complain about ~1h message delivery delays:

    
    Nov  1 16:21:32 mail postfix/lmtp[2471]: connect to mail.mydomain.tld[111.22.33.44]:7025: Connection timed out
    Nov  1 16:21:32 mail postfix/lmtp[2471]: 89A2081F938A: to=<somemail@mydomain.tld>, relay=none, delay=127, delays=0/0/127/0, dsn=4.4.1, status=deferred (connect to mail.mydomain.tld[111.22.33.44]:7025: Connection timed out)
    Nov  1 16:28:48 mail postfix/lmtp[8717]: connect to mail.mydomain.tld[111.22.33.44]:7025: Connection timed out
    Nov  1 16:28:48 mail postfix/lmtp[8718]: connect to mail.mydomain.tld[111.22.33.44]:7025: Connection timed out
    Nov  1 16:28:48 mail postfix/lmtp[8716]: connect to mail.mydomain.tld[111.22.33.44]:7025: Connection timed out</somemail@mydomain.tld> 
    

    111.22.33.44 - public IP (should be private, overriden in pfSense)
    mail.mydomain.tld - Zimbra's hostname

    Any ideas why pfSense sometimes (~once in a week) returns me public IP instead of private IP (overriden)?

    Theoreticaly I can add few tests to Zabbix to nslookup pfSense every minute and check reply/status. But don't know if it'll bring me some additional info to solve issue.

    Thanks in advance for your help :)

    P.S. pfSense 2.3.2


  • LAYER 8 Global Moderator

    So your sure your only asking pfsense..  If you have an override, that is only thing that would be returned.  If you getting public my guess is you are asking something else..



  • @johnpoz:

    So your sure your only asking pfsense..  If you have an override, that is only thing that would be returned.  If you getting public my guess is you are asking something else..

    You are right in theory, but I have strange results in practice.
    I'll add Zabbix checks (every minute local nslookup on Zimbra host + direct nslookup to pfSense) and will be back with results.



  • Issue solved. When Zimbra behind pfSense asks for MX record for some domain (like domain.tld), pfsense resolver (unbound) replies in single message containing both MX record (name) and resolved MX record (IP).

    Unbound do not respect here your host override for MX host (mail.domain.tld should be 10.1.80.4 in my case):

    [zimbra@mail root]$ dig @10.1.80.1 domain.tld MX
    
    ; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7_5.1 <<>> @10.1.80.1 riko-group.com MX
    ; (1 server found)
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 60845
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 2
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;domain.tld.                        IN      MX
    
    ;; ANSWER SECTION:
    domain.tld.         86023   IN      MX      10 mail.domain.tld.
    
    ;; ADDITIONAL SECTION:
    mail.domain.tld.    86023   IN      A       111.222.33.44
    
    ;; Query time: 0 msec
    ;; SERVER: 10.1.80.1#53(10.1.80.1)
    ;; WHEN: Mon Dec 10 14:47:28 MSK 2018
    ;; MSG SIZE  rcvd: 80
    
    
    [zimbra@mail root]$ dig @10.1.80.1 mail.domain.tld A +short
    10.1.80.4
    

    All seems good when Zimbra DNS cache has A record populated for MX name (mail.domain.tld), but if one time cache is free and Zimbra performs MX record lookup, it will have external IP instead of internal one = mail flow stopped.


  • LAYER 8 Global Moderator

    This thread is 2 years old... And you want to just pick it up now? Where you in a COMA?



  • This issue is really rare. But it happens. I thought it would be good for someone who will be affected by problem to find a solution. I was thinking it's Microsoft DNS server's problem, but it's pfsense.

    BTW, with old "DNS Forwarder" there is no such issue - it don't try to push resolved MX IP along with MX name.

    [zimbra@mail root]$ dig @10.1.80.1 domain.tld MX
    
    ; <<>> DiG 9.9.4-RedHat-9.9.4-61.el7_5.1 <<>> @10.1.80.1 riko-group.com MX
    ; (1 server found)
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 28296
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 1452
    ;; QUESTION SECTION:
    ;domain.tld.                        IN      MX
    
    ;; ANSWER SECTION:
    domain.tld.         10699   IN      MX      10 mail.domain.tld.
    
    ;; Query time: 6 msec
    ;; SERVER: 10.1.80.1#53(10.1.80.1)
    ;; WHEN: Mon Dec 10 15:22:03 MSK 2018
    ;; MSG SIZE  rcvd: 64
    
    [zimbra@mail root]$
    

  • LAYER 8 Global Moderator

    Hmmm.. To be honest not really unbound place to have to return any additional records on MX query. I do believe it defaults to minimal response: yes

    But it should really be the responsibility of the MTA wanting to know where to send mail to do a query for the MX record returned.

    If you do a query for A mail.domain.tld do you get back the IP... If so then unbound is working how it should.. I would have to reread some RFCs but its not really the responsibility of the NS to return additional records.. While it might, the MTA should not think just because it asks for MX that it will also get back in the same query the A or AAAA record for that host.domain.tld listed in the MX..

    https://wiki.zimbra.com/wiki/MTA
    3. Internet to Zimbra

    In order for a remote MTA on the internet to send mail to the Zimbra server, the remote host will look in DNS for MX record(s) for the destination domain (domain.com). After finding out that the MX record for domain.com is zimbra.domain.com, the remote MTA will look for the A record of zimbra.domain.com, so that it can connect to the appropriate server (Zimbra) and deliver the mail. If these entries are not available in public DNS, you probably will not receive mail from remote accounts.

    From that above - your MTA should do another query for the A record of the MX fqdn.. So unbound not returning it as an additional record is not the reason for your problem.



  • Just tested with Microsoft DNS & Bind - seems it's standard behavior of DNS servers to return resolved IPs along with MX names in single reply to MX-type query.

    On Zimbra's side it's not Zimbra working in "bad way", it's OS (in my case it's caching Unbound, standard part of Zimbra) who performs all DNS queries for MTA. And I believe caching this full replies (IPs included along with names) on client side looks reasonable - just cache as much as posible to do as little DNS queries as possible.

    Anyway, here is how we configure host overrides in pfSense/Unbound now (A record):

    [2.4.4-RELEASE][admin@gw.domain.tld]/root: cat /var/unbound/host_entries.conf | grep mail
    local-data-ptr: "10.1.80.4 mail.domain.tld"
    local-data: "mail.domain.tld. A 10.1.80.4"
    

    DNS override is one of standard/recomended ways of putting Zimbra/other MTA behind firewall (without public IP on MTA itself). With current Unbound host overrides (in pfSense) we just can't do it properly with "DNS Resolver". Old dnsmasq (i.e. "Forwarder") don't have this problem at least because it's too dumb to return smart replies. I don't know if it's Unbound bug, but not respecting own host override in transparent zone when returning MX reply along with A part looks suspicious. And to be honest, I don't know what record can we add to our Unbound config to force it to return proper IP.... we already have proper IP here... (A)

    May be we should fire a bug on Unbound's bugtracker or someone who is expert in Unbound can review pfSense's Unbound config (may be we solve issue with some Unbound's settings).


  • LAYER 8 Global Moderator

    You do understand that BIND and MS are designed to be AUTHORITATIVE, while unbound is meant as a recursive cache..

    But again does not matter.. additional records are not required to be returned... It is to the MTA do do that additional query..

    Use BIND if you want an authoritative NS.. Or use a domain override pointing to the an authoritative NS that returns the additional info..

    If the NS is not authoritative for same domain the MX resides in - why should you think you will get back the A record.. In such a setup it would not be possible for that NS to return the records for you without doing recursion.. Which authoritative NS normally do not do, etc.

    To be honest this WAD if you ask me.. You should look into why your MTA is not doing the query for the fqdn of the MX that was returned. Vs why an recursive NS not designed to be authoritative doesn't return additional records.

    Ask say quad 9 for say gmail.com mx - do you get back the additionals.. NO... Or 1.1.1.1 or 4.2.2.2 etc. etc.. For that matter 8.8.8.8 but if you ask the authoritative NS then you get back the additional.



  • @johnpoz said in DNS Resolver host overrides don't work SOMETIMES:

    You do understand that BIND and MS are designed to be AUTHORITATIVE, while unbound is meant as a recursive cache..

    No dirrerence here. Looks like both caching and authoritative DNS servers can work in same way - they may return IPs along with names in reply to MX-query. And looks like it's normal behavior.

    Unbound doc (https://nlnetlabs.nl/documentation/unbound/unbound.conf/):

    minimal-responses: <yes or no>
                  If yes, Unbound  doesn't  insert  authority/additional  sections
                  into  response  messages  when  those sections are not required.
                  This reduces response size  significantly,  and  may  avoid  TCP
                  fallback  for  some responses.  This may cause a slight speedup.
                  The default is yes, even though the DNS  protocol  RFCs  mandate
                  these  sections,  and the additional content could be of use and
                  save roundtrips for clients.  Because they are not used, and the
                  saved  roundtrips are easier saved with prefetch, whilst this is
                  faster.
    

    If I add

    [2.4.4-RELEASE][admin@gw.domain.tld]/root: cat /var/unbound/unbound.conf | grep minimal-res
    minimal-responses: yes
    

    then Unboun STOPS returning additional info for MX queries. Per Unbound's doc it should be default, but looks like it's not.

    Anyway, if I OVERRIDE A record for some hostname in pfSense's (default) resolver and point all internal clients to pfSense as DNS server, i expect it (pfSense) to NEVER EVER return external IP for overriden host. No matter is it reternet as reply to A-type query or as additional part to MX-type query. DNS reslovers CAN cache this queries (both main and additional part), so don't blame Zimbra here cos we are talking about DNS server<->resolver relations and on other side we can have any resolver - windows resolver works same way (tested).

    What we have now is that external IPs for overriden hosts are leaking sometimes through pfSense into internal network via additional replies of Unbound. It may break things (and so it does). To solve this issue we can just add "minimal-responses: yes" to default Unbound's config in pfSense.


  • LAYER 8 Global Moderator

    By default local overrides are transparent and not static..

    Ie if you put in www.domain.tld and user looks up other.domain.tld then it will query for that..

    You are free to change the settings to meet your needs..

    So your problem was leakage of additional records?

    There is no entry by default in the config, so if default is yes... But even when set to no it doesn't return your A record for you.

    Your talking about queries it does to external if additional are returned it will not strip them... That does not mean it will return additional when given an overrides.

    Your MTA should not expect to get back additional, and should do the query - but now it sounds like you were getting leakage of additional as your problem?



  • @johnpoz said in DNS Resolver host overrides don't work SOMETIMES:

    You are free to change the settings to meet your needs..
    So your problem was leakage of additional records?

    It's not just "my problem". It either Unbound's bug or it's bug with pfSense's "host override" implementation.

    One more time: if I override some host in pfSense I expect it to work in ALL cases.

    Actual behavior: for queries where pfsense (Unbound) returns overriden host as additional RR it sometimes returns external IP.

    Even if it's Unbount bug, we should still address this on pfsense's side till it's resolved in upstream.

    Anyway - https://redmine.pfsense.org/issues/9189



  • @johnpoz said in DNS Resolver host overrides don't work SOMETIMES:

    Your MTA should not expect to get back additional, and should do the query - but now it sounds like you were getting leakage of additional as your problem?

    Forget about MTA. MTA does not do DNS queries itself. It relies either on local system resolver or local caching DNS server.

    Here is problem in short.

    1. I expect to have mail.domain.tld to be 10.1.80.4 in my LAN. I create this override on pfSense.
    2. I have local (caching & authoritative) DNS server (and domain controller at same time) and I point it to pfSense as DNS forwarder. Let's call it DC1
    3. All local PCs/servers have DC1 as DNS server.
    4. I have Zimbra, on Zimbra for DNS server I use DC1 (as on all other PCs/servers).

    Query: Zimbra -> DC1: give me MX for domain.tld?
    Query: DC1 -> pfSense: give me MX for domain.tld?
    Reply: pfSense -> DC1: MX for domain.tld is mail.domain.tld AND mail.domain.tld IP is 1.2.3.4 (external IP).
    Now DC1's cache is polluted with wrong IP for mail.domain.tld
    Reply: DC1 -> Zimbra: MX for domain.tld is mail.domain.tld AND mail.domain.tld IP is 1.2.3.4
    Now Zimbra's cache is polluted with wrong IP for mail.domain.tld

    If I clean DNS cache on Zimbra, I still get wrong IP as a reply to simple A query because DC1's DNS cache is still polluted with it.

    And this breaks things!


  • LAYER 8 Global Moderator

    When I say MTA, I mean the BOX your MTA is running on.. Not the process that will actually do the query..

    But again its is NOT the responsibility of unbound to hand back the A record of the MX record.. That is the responsibility of the client asking for the MX to also query for the A if it needs it..

    Set your zone to be static if you do not want pfsense to do querys for stuff it has no records or cache of.. This wold be done in the same box where you add the MX and A records that your trying to override..

    Where is the query that your seeing this come back from pfsense... List the cache records in pfsense that show this external IP.. Simple enough to view records in the cache for any specific domain just do a grep on the dump_cache command. And then show the query to pfsense where it hands back this info..

    As I brought up over 2 years ago... You sure this box running MTA just doesn't list another NS for dns that is might be asking and getting this other info you are wanting to override?

    Here is example... Unbound out of the box has min response default as yes.

    I do a query for the MX records of netgate.com to SOA ns of netgate.. I get back additional records..

    E:\>dig @ns1.netgate.com netgate.com mx
    
    ; <<>> DiG 9.12.3 <<>> @ns1.netgate.com netgate.com mx
    ; (1 server found)
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 5860
    ;; flags: qr aa rd; QUERY: 1, ANSWER: 7, AUTHORITY: 2, ADDITIONAL: 5
    ;; WARNING: recursion requested but not available
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;netgate.com.                   IN      MX
    
    ;; ANSWER SECTION:
    netgate.com.            3600    IN      MX      30 aspmx5.googlemail.com.
    netgate.com.            3600    IN      MX      10 aspmx.l.google.com.
    netgate.com.            3600    IN      MX      20 alt2.aspmx.l.google.com.
    netgate.com.            3600    IN      MX      20 alt1.aspmx.l.google.com.
    netgate.com.            3600    IN      MX      30 aspmx2.googlemail.com.
    netgate.com.            3600    IN      MX      30 aspmx4.googlemail.com.
    netgate.com.            3600    IN      MX      30 aspmx3.googlemail.com.
    
    ;; AUTHORITY SECTION:
    netgate.com.            3600    IN      NS      ns2.netgate.com.
    netgate.com.            3600    IN      NS      ns1.netgate.com.
    
    ;; ADDITIONAL SECTION:
    ns1.netgate.com.        3600    IN      A       208.123.73.80
    ns1.netgate.com.        3600    IN      AAAA    2610:160:11:11::80
    ns2.netgate.com.        3600    IN      A       162.208.119.38
    ns2.netgate.com.        3600    IN      AAAA    2610:1c1:3::108
    
    ;; Query time: 37 msec
    ;; SERVER: 208.123.73.80#53(208.123.73.80)
    ;; WHEN: Tue Dec 11 04:15:32 Central Standard Time 2018
    ;; MSG SIZE  rcvd: 340
    
    
    E:\>
    

    If I ask pfsense for the same mx - the additional are not given..

    E:\>dig @192.168.9.253 netgate.com mx
    
    ; <<>> DiG 9.12.3 <<>> @192.168.9.253 netgate.com mx
    ; (1 server found)
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 6624
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 7, AUTHORITY: 0, ADDITIONAL: 1
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;netgate.com.                   IN      MX
    
    ;; ANSWER SECTION:
    netgate.com.            3445    IN      MX      20 alt1.aspmx.l.google.com.
    netgate.com.            3445    IN      MX      10 aspmx.l.google.com.
    netgate.com.            3445    IN      MX      30 aspmx5.googlemail.com.
    netgate.com.            3445    IN      MX      30 aspmx4.googlemail.com.
    netgate.com.            3445    IN      MX      20 alt2.aspmx.l.google.com.
    netgate.com.            3445    IN      MX      30 aspmx3.googlemail.com.
    netgate.com.            3445    IN      MX      30 aspmx2.googlemail.com.
    
    ;; Query time: 0 msec
    ;; SERVER: 192.168.9.253#53(192.168.9.253)
    ;; WHEN: Tue Dec 11 04:20:33 Central Standard Time 2018
    ;; MSG SIZE  rcvd: 216
    
    
    E:\>
    

    So lets see this query or cache from unbound showing you the wrong info..

    BTW you also Notice that while I got back additional info from ns1.netgate.com - it is NOT the A records of the MX records!!! Since the A records the MX point to it is not authoritative for... But if I ask ns1.google.com for gmail.com mx it does send back the A records.

    E:\>dig @ns1.google.com gmail.com MX
    
    ; <<>> DiG 9.12.3 <<>> @ns1.google.com gmail.com MX
    ; (1 server found)
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 22942
    ;; flags: qr aa rd; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 10
    ;; WARNING: recursion requested but not available
    
    ;; QUESTION SECTION:
    ;gmail.com.                     IN      MX
    
    ;; ANSWER SECTION:
    gmail.com.              3600    IN      MX      40 alt4.gmail-smtp-in.l.google.com.
    gmail.com.              3600    IN      MX      20 alt2.gmail-smtp-in.l.google.com.
    gmail.com.              3600    IN      MX      5 gmail-smtp-in.l.google.com.
    gmail.com.              3600    IN      MX      10 alt1.gmail-smtp-in.l.google.com.
    gmail.com.              3600    IN      MX      30 alt3.gmail-smtp-in.l.google.com.
    
    ;; ADDITIONAL SECTION:
    alt4.gmail-smtp-in.l.google.com. 300 IN A       74.125.193.27
    alt4.gmail-smtp-in.l.google.com. 300 IN AAAA    2a00:1450:400b:c01::1b
    alt2.gmail-smtp-in.l.google.com. 300 IN A       172.217.204.27
    alt2.gmail-smtp-in.l.google.com. 300 IN AAAA    2607:f8b0:400c:c15::1a
    gmail-smtp-in.l.google.com. 300 IN      A       173.194.197.27
    gmail-smtp-in.l.google.com. 300 IN      AAAA    2607:f8b0:4001:c1b::1b
    alt1.gmail-smtp-in.l.google.com. 300 IN A       173.194.66.27
    alt1.gmail-smtp-in.l.google.com. 300 IN AAAA    2607:f8b0:400d:c01::1b
    alt3.gmail-smtp-in.l.google.com. 300 IN A       172.217.192.27
    alt3.gmail-smtp-in.l.google.com. 300 IN AAAA    2800:3f0:4003:c02::1a
    
    ;; Query time: 21 msec
    ;; SERVER: 216.239.32.10#53(216.239.32.10)
    ;; WHEN: Tue Dec 11 04:21:55 Central Standard Time 2018
    ;; MSG SIZE  rcvd: 370
    
    
    E:\>
    

    But if I then ask unbound the same.. No additional records given.

    E:\>dig @192.168.9.253 gmail.com MX
    
    ; <<>> DiG 9.12.3 <<>> @192.168.9.253 gmail.com MX
    ; (1 server found)
    ;; global options: +cmd
    ;; Got answer:
    ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46825
    ;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 1
    
    ;; OPT PSEUDOSECTION:
    ; EDNS: version: 0, flags:; udp: 4096
    ;; QUESTION SECTION:
    ;gmail.com.                     IN      MX
    
    ;; ANSWER SECTION:
    gmail.com.              3069    IN      MX      10 alt1.gmail-smtp-in.l.google.com.
    gmail.com.              3069    IN      MX      40 alt4.gmail-smtp-in.l.google.com.
    gmail.com.              3069    IN      MX      5 gmail-smtp-in.l.google.com.
    gmail.com.              3069    IN      MX      20 alt2.gmail-smtp-in.l.google.com.
    gmail.com.              3069    IN      MX      30 alt3.gmail-smtp-in.l.google.com.
    
    ;; Query time: 0 msec
    ;; SERVER: 192.168.9.253#53(192.168.9.253)
    ;; WHEN: Tue Dec 11 04:23:49 Central Standard Time 2018
    ;; MSG SIZE  rcvd: 161
    
    
    E:\>
    

    BTW you also left in the actual domain your trying to override in your dig command... Which testing with asking outside dns like 8888 or quad9 or 1111 does not return additional for... ONLY when you ask the authoritative NS do you get back additional... And again if I ask unbound for this - even without any overrides you do not get back the additional... Even when its cached!!!

    0_1544525152437_unboundquery.png

    It would be much easier to talk about your overrides and what gets returned from the SOA of the domain and what gets returned by unbound if we could just actually use the domain... But since you have be hiding it - I kept it hidden as well, even though you missed it in your dig ;)

    edit
    Your running 2.4.4 release.. Unbound was UPDATED in 2.4.4p1 -- maybe there was issue with previous unbound not using the default of yes with min-responses?


Log in to reply