Need help understanding why DNS was not resolving a certain hostname until I pinged it from pfSense
I run pfSense on a Netgate 2100 MAX in a colocation facility. Last Thursday all of my servers behind the device were not able to ping (or talk to at all) paypal.com.
Every server on the network was returning:
ping: paypal.com: Temporary failure in name resolution
It was bad because we have to hit Paypal on the backend, so our backend services were down. I logged in to pfSense on the 2100 and used the Diagnostics to ping paypal.com and it worked perfectly.
Then the backend services magically restored themselves.
Please help me understand what the failure was, and why pinging from the firewall resolved it.
How do you have dns setup.. Are you just resolving? Are you forwarding? Are you using dot?
@johnpoz Good question, forgot to mention that.
I have all servers pointing their DNS at the pfSense box.
@scubanarc But what is pfsense doing? Out of the box it would just resolve.. And there is no difference between pfsense itself asking itself (unbound) for whatever.com vs a client asking unbound for whatever.com
@johnpoz It's in the out of the box config.
Services / DNS Resolver / General Settings: Enable DNS resolver is checked.
I have not changed any other settings.
More info... it was "broken" for 7 hours. The very second I tried the ping in the pfSense menu it started working again on all of the servers.
Does unbound cache DNS records at all?
Does unbound cache DNS records at all?
Yeah it caches NS its talked to, it caches records it gets for the ttl that the authoritative NS puts on that record, etc.
@johnpoz Is it possible that it cached some sort of bogus result and passed that on for 7 hours, and that cache was somehow cleared or refreshed when the web GUI asked for it?
@scubanarc anything is possible with the amount of info we have..
Temporary failure in name resolution
that is a linux sort of error - the box you were was linux right.. You sure your linux box is actually directly pointed to pfsense.. They quite often point to their own local cache 127.0.0.53
if you happen to have the issue again - from the client actually do a dns query via say dig, or host or even nslookup so we can get some sort of understanding is unbound not answering, is it sending back a refused, nx, etc.
If you were actually looking for fqdn paypal.com - the ttl on those is 300 second (5 minutes)
;; QUESTION SECTION: ;paypal.com. IN A ;; ANSWER SECTION: paypal.com. 300 IN A 220.127.116.11 paypal.com. 300 IN A 18.104.22.168
Where you looking for www.paypal.com? That is a whole set of different ns via cname.
;; QUESTION SECTION: ;www.paypal.com. IN A ;; ANSWER SECTION: www.paypal.com. 3600 IN CNAME www.glb.paypal.com. ;; AUTHORITY SECTION: glb.paypal.com. 300 IN NS ns02.glb.paypalinc.com. glb.paypal.com. 300 IN NS dns2.p10.nsone.net. glb.paypal.com. 300 IN NS dns4.p10.nsone.net. glb.paypal.com. 300 IN NS dns1.p10.nsone.net. glb.paypal.com. 300 IN NS dns3.p10.nsone.net. glb.paypal.com. 300 IN NS ns01.glb.paypalinc.com.
Again with short ttl, and again points to yet another cname
;; QUESTION SECTION: ;www.glb.paypal.com. IN A ;; ANSWER SECTION: www.glb.paypal.com. 300 IN CNAME www.paypal.com-a.edgekey.net.
etc.. With the lack of actual info, the coincidental statement that when you queried on pfsense it all started to work.. We really have no idea what was going on.
You say your servers directly talk to paypal - through some api sort of fqdn?
@johnpoz Correct, linux. Ubuntu 20.04 to be exact.
The exact FQDN that the servers are looking for is:
Which is the IPN callback (IPN stands for Instant Payment Notification).
However, while it was down I was also unable to ping paypal.com (without the www), which I found very weird.
I just ran dig on paypal.com and I can see that you are correct, the reply came from 127.0.0.53#53, so it's possible that my local machine was caching the incorrect reply.
But I want to stress that this was simultaneously broken on multiple servers.
If it happens again I'll dig the address and report back here. Hopefully it does not.
Thank you for the help, I at least have some new troubleshooting tools at my disposal.
@scubanarc also check what exactly your boxes are pointing at something locally, and it can be pointing to something other than you think its pointing to..
Been awhile since dug into that, and all my linux boxes I disable it pulling info from systemd-resolved
Here is some info that should get started down that rabbit hole ;)
But from experience - sometimes the box is not actually pointing to where they think its pointing for actual dns..
so for that specific fqdn, again - you start down the cname rabbit whole with really really short ttls..
;; QUESTION SECTION: ;ipnpb.paypal.com. IN A ;; ANSWER SECTION: ipnpb.paypal.com. 300 IN CNAME ipnpb.glb.paypal.com. ;; AUTHORITY SECTION: glb.paypal.com. 300 IN NS ns01.glb.paypalinc.com. glb.paypal.com. 300 IN NS dns2.p10.nsone.net. glb.paypal.com. 300 IN NS dns3.p10.nsone.net. glb.paypal.com. 300 IN NS dns1.p10.nsone.net. glb.paypal.com. 300 IN NS dns4.p10.nsone.net. glb.paypal.com. 300 IN NS ns02.glb.paypalinc.com.
@johnpoz I was not aware of the resolv.conf issues that you linked to, that is very interesting.
Indeed my resolv.conf points to stub-resolv.conf
You say that you point it at /run/systemd/resolve/resolv.conf on your servers? Has this ever caused you any issues?
Looking at my /run/systemd/resolve/resov.conf I can see that it does have my 2100 as the nameserver.
@scubanarc no I normally disable that whole system for dns.. Not a fan of it ;)
Don't take that the wrong way - the systemd stuff works, its just if your not aware of it - you might not be actually pointing to where you "think" you were pointing to for dns ;) So it can cause confusion when troubleshooting dns.
If your aware of it, and how to configure and work with it - then its fine to use, etc.
@johnpoz Got it. Assuming I leave my settings stock (pointing at stub-resolv.conf), then the next this happens I can dig the results from pfsense by doing this:
dig @192.168.20.1 ipnpb.paypal.com
(my 2100 is 192.168.20.1)
and that will tell me what pfsense is thinking during the problem. Any other troubleshooting I can do during the outage?
@scubanarc yup that would be a directed query specific to pfsense.. You should hope to glean something from that - timeout, nx, refused - something ;)