Strange DNS issue for internal clients...

ericwentz

Hi All, I'm running pfsense+ 24.11-RELEASE on a Protectli vault. In the system DNS settings I have two servers set up -one from Cloudflare and the other from Level3. My DHCP is set up with "Enable DNS registration" and "Enable early DNS registration" checked. I'm using the DNS Resolver service.
First off, the DNS resolution to external sites is rock solid. The issue comes with my internal devices. I have the majority of them set up with pseudo fixed IPs, i.e. I'm using the mac address to bind to the IP address.
So much for the background, here's the issue: the name resolution of my internal devices is intermittent. Sometimes I get a valid ip (using nslookup from my Mac Mini) and other times, I get "Can't find server.domain.com: No answer." Then another test will return the IP address again. This behavior is the same when I use the pfsense "Diagnostics / DNS Lookup" tool. I've also had the same results from my Macbook Pro.
I'd appreciate it if anyone could shed some light on this issue. Please let me know if I can supply any additional information about my setup.
Thanks
Eric

Gertjan

@ericwentz

Like this :

So can I presume you use kea and not ISC DHCP ?

Check your /etc/hosts file.
It looks correct ?

@ericwentz said in Strange DNS issue for internal clients...:

Sometimes I get a valid ip (using nslookup from my Mac Mini) and other times, I get "Can't find server.domain.com: No answer." Then another test will return the IP address again. This behavior is the same when I use the pfsense "Diagnostics / DNS Lookup" tool.

Humm, if even the "Diagnostics / DNS Lookup" gives no answer, this means that it could conrtact unbound, the Resolver.
Or, that process is always in the running starte, so 127.0.0.1:53 will answer to any questions.

Check you Status > System Logs > System > DNS Resolver log file and locate the word "start", does it start (thus restart) often ? This can happen every time if there is an pfSense network interface that goes down and then up again.

Example :

When I set up a DHCP static lease on my LAN DHCP server like this :

then from now on, it info will also exist in the /etc/hosts file :

and this file is 'integrated' into unbound, the Resolver, so :

Extra info : whe dointg this :

@ericwentz said in Strange DNS issue for internal clients...:

I get a valid ip (using nslookup from my Mac Mini) and other time

be sure that the request is send to the pfSense LAN IP, and not some 8.8.8.8 or other DNS server.

Just to be sure :
Packet capture on your LAN port, set it up with "all the details", TCP and UDP, port 53, and IP address = 192.168.1.1 or whatever LAN pfSense IP you have.
Now capture.
Do a nslookup on your mini mac and you should see the packet with the DNS request.

You can also go wild with the Resolver log details : Pick any : the higher the better :

and you'll see your DNS request coming in, and handled.

Warning : don't forget to set this back to a normal level like "Level 1" as high levels will produce huge quantities of log lines.

johnpoz

@ericwentz said in Strange DNS issue for internal clients...:

Please let me know if I can supply any additional information about my setup.

Your clients only point to pfsense IP on your network for dns? When clients point to more than 1 name server your never really sure which one it might ask.

So for example if your client has say

8.8.8.8
192.168.1.1 (pfsense IP)

And you ask pfsense IP for say server.home.arpa and it knows about this you will get an answer. But if you ask 8.8.8.8 it is not going to have a clue about anything in a home.arpa domain.

If your going to point your clients to more than 1 IP for dns - you need to be sure that your different IPs can resolve the same stuff.

If your pointing to only pfsense IP on your clients, and you sometimes get an answer for server.home.arpa and sometimes not, it could be unbound is restarting.. If unbound is in the middle of restarting, it can't answer anything.

ericwentz

Thanks so much for the prompt replies - much appreciated. I got into the console and was looking at /var/log/dhcpd.log (I used tail -f ). Looks like something is bouncing up and down every few seconds. Here's a snippet out of the log file:
May 15 14:14:05 fw kea2unbound[26679]: Remove record: "zentrios-ace.{redacted}. 600 IN A XX.X.10.72"
May 15 14:14:06 fw kea2unbound[26679]: Write include: /var/unbound/leases/leases4.conf (719c5c75ef3cb1c10f9e35886f48ba3fb90b09cb9d4105f510567584cd54c475)
May 15 14:14:17 fw kea-dhcp4[26043]: WARN [kea-dhcp4.dhcp4.0x636b2812000] DHCP4_MULTI_THREADING_INFO enabled: yes, number of threads: 4, queue size: 64
May 15 14:14:17 fw kea-dhcp4[26043]: ERROR [kea-dhcp4.commands.0x636b2812000] COMMAND_SOCKET_WRITE_FAIL Error while writing to command socket 26 : Broken pipe
May 15 14:14:17 fw kea-dhcp4[26043]: ERROR [kea-dhcp4.commands.0x636b2812000] COMMAND_SOCKET_WRITE_FAIL Error while writing to command socket 29 : Broken pipe
May 15 14:14:17 fw kea-dhcp4[26043]: ERROR [kea-dhcp4.commands.0x636b2812000] COMMAND_SOCKET_WRITE_FAIL Error while writing to command socket 32 : Broken pipe
May 15 14:14:17 fw kea-dhcp4[26043]: ERROR [kea-dhcp4.commands.0x636b2812000] COMMAND_SOCKET_WRITE_FAIL Error while writing to command socket 28 : Broken pipe
May 15 14:14:17 fw kea-dhcp4[26043]: ERROR [kea-dhcp4.commands.0x636b2812000] COMMAND_SOCKET_WRITE_FAIL Error while writing to command socket 31 : Broken pipe
May 15 14:14:31 fw kea2unbound[89935]: Add record: "winserver.{redacted}. 600 IN A XX.X.1.5"

I had a bit more - but my post was getting marked as spam. But it seems like I'm cycling through a remove/add cycle every few seconds with those warnings and errors included.

ericwentz

Okay I think I may have solved it (time will tell). I had set up the service watchdog to keep my DHCP server running - unforunately, it was connected to the OLD DHCP service and kept trying to start it. I'll keep an eye on it for a few days and add a follow-up post if in this was the actual fix for the issue.

Hope that maybe this may help someone else as well - this was a pain!!
Eric

johnpoz

@ericwentz dhcp has nothing to do with dns.. Other than it writing records to unbound.. But not being able to write something, shouldn't prevent unbound from answering for something that it already knows about.

Unless the failure to write is because unbound is down.

I personally don't think kea is ready for primetime, at least for my use case.. But these bothers me..

Remove record: "zentrios-ace.{redacted}. 600 IN A XX.X.10.72"
Add record: "winserver.{redacted}. 600 IN A XX.X.1.5"

Why would it be using a 5 minute ttl (600 seconds) for a record it was adding to unbound? Because of a dhcp lease? That is really low ttl, I haven't looked into kea really since first preview..

Why would it be removing a record, did the lease expire, was it released by the client?

I got around the issue of isc restarting unbound all the time by not registering dhcp, and only dhcp reservations.. Once a client is going to be on my network I set a reservation so it always has the same IP... If something is just going to be on my network temp, like a guest user to my wifi, or some box working on for someone, etc. I have zero need to resolve it via a fqdn.

I would hope when kea is ready for primetime they would allow you to adjust what the TTL is of records its going to add via dhcp leases. Most of my leases are like 8 days long - there would be no reason to have a ttl of 5 minutes.. All of mine are min of 1 hour.

Gertjan

@johnpoz said in Strange DNS issue for internal clients...:

Why would it be using a 5 minute ttl (600 seconds) ...

Here is some fresh info about that short TTL.

johnpoz

@Gertjan thanks..

"Unbound cache with the TTL being one-third of the lease duration. "

So your saying he has a 15 minute lease time set?? That seems really low ;)

Gertjan

@johnpoz said in Strange DNS issue for internal clients...:

So your saying he has a 15 minute ...

Not me

I looked for TTL/ttl in /usr/local/bin/kea2unbound - I found where the local-data xxxxxxx are created, and these line, imho, can't loose their TTL value as they are known / declared locally, like the revolver"s "Host Overrides" : ones declared they stay valid for live.

True is, DHCP leases are always time limited ..... and bingo, found it - I search with "3" and found it straight away

It's RFC defined behaviour.

johnpoz

@Gertjan

RFC clearly states

"but SHOULD NOT be less than 10 minutes. "

Yet seems its putting in 5 minutes - by my math, 5 is less than 10 ;)

Like I said - not ready for primetime imho. So 2 hour default lease time, 1/3 of that then the ttl put in should be like 40 minutes.

ericwentz

Okay, closing the issue - removing the Service Watchdog for the old DHCP service and adding an entry for the new kea-dhcp4 service has solved the issue. I've also taken the feedback regarding my TTL times and set them all to 7200. Problem solved. Thanks to all for your generous contributions to helping me work this problem. This forum is great.

Best to all - Eric

johnpoz

@ericwentz where did you set that in kea? If kea is registering the entry? Did you set min ttl in unbound or something?

And watchdog sure shouldn't be needed.. It has had series issues in the past.

ericwentz

I've set the DHCP TTL on the "Services->DHCP Server" then to the settings for each individual network under the "Other DHCP Options" - there's an entry called "Default Lease Time" which looks like defaults to 7200 seconds, but I explicitly put this value in - just to be sure.

Finally, I've removed the "Service Watchdog" service - I really have not had any issues with services failing, so this is probably unnecessary. I had originally configured it early in my pfsense journey and never looked at it again. Figure no sense in putting any extra load on the FW. -e

johnpoz

@ericwentz and the dhcp lease time has zero to do with a dns ttl on a record.. The default is 7200 seconds, or 2 hours.

Which per the rfc Gertjan pointed out the registration of that in dns should be like 1/3 of the lease and not shorter than 10 minutes..

My issue is what you showed in the log of kea was it was writing a record with a ttl of 5 minutes - which to be honest on a local network is insanely low.. Make zero sense to me and clearly not following the rfc.