Certain external DNS servers cannot query our public DNS server
We having a DNS issue while pfSense 2.0.1 release and even 2.1 Beta 0 10.07.2012 are either one deployed in front of our network.
Some companies outside of our network are unable to run DNS queries on our DNS server within our network. 126.96.36.199
Example: google webmaster tools, sudden link, elephant network (colo in florida)…
We have opened all ports and protocols on our pfSense machine in "floating" (with "Apply the action immediately on match") and each interface, turned off all auto NAT generation rules as we are a public IP provider, completely disabled all packet filtering (which negates the first two items i just listed), and pulled out all of our hair.
Has anyone run into this weird issue?! Not every DNS server out there has this issue, but enough do that our customers are hurting a lot.
EDIT: I have searched extensively over the past few weeks for help on this issue; so if I missed the magic article or thread post that answers my question, please direct me to it and feel free to close this thread.
Thank you for your time,
No packages installed; no extra services setup.
Configuration includes using a virtual IP for the WAN on our side (pfSense) and HSRP on our colo's side (cisco).
The internal network is connected to two switches via lagg active/passive. no traffic at all on the second switch unless the primary is unplugged.
EDIT: we packet sniff all DNS traffic (tcp and udp) on pfsense and find nothing from these servers hitting us, even when we run a "dig @x.x.x.x asldfjdljf" cmd.
Well if they are not hitting your pfsense - then there is an issue before you on getting to your network. Can they ping you, have them do a traceroute to the IP your using for dns.
Boggles my mind - so your thinking it is pfsense, when the packets don't even get to pfsense. Really? Come on - you really think that?
I apologize for not including this detail last night:
You would think the issue was simply that the traffic wasn't reaching our network and likely was outside our control, but Time Warner Cable and our colo have investigated as far as they are willing to and show no issues on their setup. The reason no-one wants to investigate deeply has to be because these companies and their networks can ping our DNS server and request websites on our network but cannot DNS query our DNS server.
Riddle me that. :)
How about a strange routing issue?
Maybe an overly long hop length from the rest of the DNS servers resulting in packets TTL expiring prior to getting to you?
You might be able to tell that from a traceroute from your DNS servers to those who are having a problem and seeing if you are getting blocked also at a certain point. Can your ISP monitor traffic (I bet they can) and see if that traffic is getting to the gateway they have provided you? It sounds like you would have this problem with any firewall or no firewall at all.
So heres the thing just because they can ping you does not mean that DNS is open between you. You are not seeing the traffic at your wan interface of your firewall right?? Then its not pfsenses issue, can not forward or do anything with traffic it can not see ;)
So when they ping you - you see these pings where your doing your capture right!! You are capturing on the wan(internet) interface of your pfsense router right. If you don't see the dns packets - then the issue is between your pfsense and them. not pfsense.
thank you all for your suggestions!
@johnpoz: i am not sure what your last sentence meant; it seemed to conflict the rest of your reply. thank you for bringing up for everyone that ping (traceroute) and DNS are not the same. we do realize that ICMP and DNS are two different protocols as well as we realize some deployments of DNS servers/forwarders utilize UDP rather than TCP.
@podilarius: TWC has not been much help. we are indeed able to trace to the DNS servers that cannot communicate with us on queries.
@mike: indeed it seems to be a strange routing issue. i have no idea if there is a long hop length between the systems on these requests. i would love to know. if our next couple plans do not work to resolve the issue and any suggestions here yield no fruit, then we will likely have to chock it up to an odd routing issue that will decimate our customers and our upcoming projects. we will likely have to move all DNS outside our cage which i do not want to think about because we need absolute control for our projects. :/
we are requesting our colocation facility remove HSRP so we can switch between our pfSense deployment and the previous router we used, which worked fine but was too old to be comfortable leaving in place.
thank you all for usin some of your brain juice on this issue,
"i am not sure what your last sentence meant;"
I don't know how much clear I could make it..
"EDIT: we packet sniff all DNS traffic (tcp and udp) on pfsense and find nothing from these servers hitting us"
Then WTF does pfsense have to do with it?? If your not seeing the dns traffic at the WAN interface of pfsense - then pfsense has NOTHING TO DO WITH YOUR PROBLEM!! NOTHING!!
@john: "then the issue is between your pfsense and them." was my point. was not looking to correct but understand. if you prefer to attack, please move on to another thread.
We have experienced that firewall software can prevent our packet sniffing software on the same server from reporting what is blocked. This is the reason I have posted on the pfSense forums. We have not determined a way to put a machine between the pfSense box and the upstream router without breaking HSRP and thus routing entirely.
It is entirely possible someone has experienced this in a similar deployment. It is entirely possible this is not related to the pfSense distro. Posting this question makes it easy for the next guy searching the web on this problem to find a useful answer. I will post whatever the outcome reveals itself to be.
What johnpoz is inferring is that you have knowledge on how pf (basis of pfSense) works. This may not be the case. Most don't. So it is understandable where the disconnect is. Filtering happens in the kernel. When you do a tcpdump (packet filtering on pfSense) on WAN interface, you are seeing the traffic before filtering has even taken place. So, if you don't see the traffic there, it is not making it to the interface. This usually indicates a problem upstream. This is why you would ask your ISP to packet sniff on the next hop up from yours. If the traffic is not making it to that interface, then you have to keep going up.
I had this problem also (or behavior was similar but not quite the same). As it turned out, I miss typed the DNS server address at the registrar. So all that tried to resolve to the primary (ns1) failed to get the name servers, those hitting the secondary name server (ns2), was resolving and working correctly.
I would double and triple check everything when it comes to DNS. Once its right, then it works flawlessly, but one typo can set you back a couple of hours. Can those who cannot resolve manually enter your DNS server and resolve? ie "dig @ <ipaddress><server.domain.tld>". As a general rule of thumb, you cannot packet sniff anything on the WAN side of a firewall without being on that side of the FW. I would use either the command line tcpdump or its counter part in the GUI in diagnostics on the WAN address while someone tries to make the connection.</server.domain.tld></ipaddress>
^ great write up podilarius
"When you do a tcpdump (packet filtering on pfSense) on WAN interface, you are seeing the traffic before filtering has even taken place."
Where would it not be like this?? I have never seen a system where if I am sniffing on the inbound interface it would be after the filtering. That makes no sense.
I like your theory of the mistake in the IP at the registrar – very logical breakdown of how that could cause their symptoms.
Thank you all for continuing to press into this.
My statement, "firewall software can prevent our packet sniffing software on the same server from reporting what is blocked" was information shared with me by the persons troubleshooting it. It maybe have been a mis-communication and unfortunately they are not available to query.
Indeed, triple checking every stage and step involved is wise and has been done. The DNS entries are correct and were correct before the change over to new routers (pfSense). Again, no idea what is the actual cause and not placing blame (for those that feel attacked).
My hunch is the issue is outside our cage but I am unable to get additional help from the providers in the path.
EDIT: Their dig commands have been unfruitful, and our dig commands from within our cage to their DNS servers have been unfruitful (no response).
ex: dig @188.8.131.52 samware.net
How about sticking a switch between your pfsense box that performs port mirroring?
That way you could use Wireshark to sniff the external interface unhindered.
@mike: I would love to do that but we have HSRP setup. We are scheduling with our colo to remove HSRP to further test the problem.
from within our cage to their DNS servers have been unfruitful (no response).
ex: dig @184.108.40.206 samware.net
So you can not get to them either on 53?? But you can ping them? Was that just an example ip and query? Or was that actual IP and domain? I don't show samware.net on that IP.
hint: when doing examples of something like that its better to be clear its an example www.example.tld, foo.bar ip 220.127.116.11 or <theirip>, etc.
So to be clear can they ping your public IP(s) that nameservers are on - those packets show up on sniff, but dns query does not show up in sniff. And you can ping their dns IP or not? But can do dns query?</theirip>
That was an example dig command for one of the DNS servers that cannot reach us for DNS queries. That format of the dig command on our linux and BSD boxes tells dig to ask the "@x.x.x.x" DNS server to make the query. It's as if I set my machine's network settings to use 18.104.22.168 as it's DNS server.
When you modify that command to query for a different domain than one we are authority over (samware.net), that DNS server can get a result. The proper result for "samware.net" is 22.214.171.124 but usually the reply is ";; connection timed out; no servers could be reached" and some rare times they will reply with an advertisement IP.
All of these DNS servers that cannot reach us can ping us and we can ping them. Ping packets show up on the sniff. Their incoming DNS queries do not show up on the sniffer.
"and some rare times they will reply with an advertisement IP."
Really?? That makes no sense - can we get a result from them doing dig +trace
So for example you mention one of your zones are samware.net
So I dns for that as ns1 and ns2.samware.net doing a whois
Domain name: samware.net
I show them as
ns1.samware.net ['126.96.36.199'] [TTL=10800]
ns2.samware.net ['188.8.131.52'] [TTL=10800]
So if I do a trace for say www.samware.net
; <<>> DiG 9.8.1-P1 <<>> www.samware.net +trace ;; global options: +cmd . 168690 IN NS a.root-servers.net. . 168690 IN NS j.root-servers.net. . 168690 IN NS e.root-servers.net. . 168690 IN NS l.root-servers.net. . 168690 IN NS f.root-servers.net. . 168690 IN NS i.root-servers.net. . 168690 IN NS d.root-servers.net. . 168690 IN NS k.root-servers.net. . 168690 IN NS g.root-servers.net. . 168690 IN NS m.root-servers.net. . 168690 IN NS h.root-servers.net. . 168690 IN NS b.root-servers.net. . 168690 IN NS c.root-servers.net. ;; Received 508 bytes from 192.168.1.253#53(192.168.1.253) in 481 ms net. 172800 IN NS e.gtld-servers.net. net. 172800 IN NS c.gtld-servers.net. net. 172800 IN NS f.gtld-servers.net. net. 172800 IN NS l.gtld-servers.net. net. 172800 IN NS b.gtld-servers.net. net. 172800 IN NS g.gtld-servers.net. net. 172800 IN NS h.gtld-servers.net. net. 172800 IN NS a.gtld-servers.net. net. 172800 IN NS j.gtld-servers.net. net. 172800 IN NS m.gtld-servers.net. net. 172800 IN NS i.gtld-servers.net. net. 172800 IN NS k.gtld-servers.net. net. 172800 IN NS d.gtld-servers.net. ;; Received 490 bytes from 184.108.40.206#53(220.127.116.11) in 733 ms samware.net. 172800 IN NS ns1.samware.net. samware.net. 172800 IN NS ns2.samware.net. ;; Received 101 bytes from 18.104.22.168#53(22.214.171.124) in 249 ms www.samware.net. 10800 IN A 126.96.36.199 ;; Received 49 bytes from 188.8.131.52#53(184.108.40.206) in 47 ms
So this time 141.5 answered - if I do it again this time the 140.5 answered
<snipped>;; Received 101 bytes from 220.127.116.11#53(18.104.22.168) in 132 ms www.samware.net. 10800 IN A 22.214.171.124 ;; Received 49 bytes from 126.96.36.199#53(188.8.131.52) in 49 ms</snipped>
I would love to see a query +trace that comes back with an advertisement IP??
BTW while looking nameservers for samware.net – they allow recursive, this is normally very bad juju for authoritative ns to allow recursive from public
; <<>> DiG 9.8.1-P1 <<>> @184.108.40.206 www.google.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 32492 ;; flags: qr rd ra; QUERY: 1, ANSWER: 5, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;www.google.com. IN A ;; ANSWER SECTION: www.google.com. 300 IN A 220.127.116.11 www.google.com. 300 IN A 18.104.22.168 www.google.com. 300 IN A 22.214.171.124 www.google.com. 300 IN A 126.96.36.199 www.google.com. 300 IN A 188.8.131.52 ;; Query time: 99 msec ;; SERVER: 184.108.40.206#53(220.127.116.11) ;; WHEN: Thu Oct 11 15:16:07 2012 ;; MSG SIZE rcvd: 112
You almost never allow that from public IP to do recursive - unless your running a public DNS server.. You leave yourself open to DOS attacks with allowing that. And now your boxes that are authoritative for your zones will not be able to answer because they are too busy doing recursive for people that shouldn't be using your box for recursive.
So if they are getting answers "advertisement IPs" back when they are doing specific @queries to your IPs - and your not seeing the packets, that tells me something else is answering.. What is answering is the question.. when you say your behind a HRSP – could this traffic destined for your IP being sent to some other NS??? How else would they be getting a response.
Yes, unless you are really meaning to, running a recursive public DNS server is bad for business. I did that for a while, but learned before it burned me.
Is samware.com on your DNS server? Where you able to trace on the WAN interface as opposed to doing it on the LAN? If the traffic is not getting to your WAN, then you have a problem upstream. If you are getting it, then we can check within pfSense to find out what is going on. You are going to have look at bit more I am afraid. What you really want to do is check WAN sniffing for the packets. You don't need to wait on HSRP to be removed, unless you thinking that is part of the problem. Either way, please let us know what you find.
Thank you all for your wonderful ideas and for pointing out the public availability of our DNS servers. At one point, we were fine with recursion for various reasons but over the past year our servers have been hammered!
Anyways, the problem was due to converting our DNS from FreeBSD to CentOS, adding IP aliases to the NIC, and not having the proper subnet assigned to those aliases. It was working fine on the old router system but since our colo made some routing changes and we implemented pfSense, the faulty subnet settings popped up.
Again, thank you all!