Frequent DNS timeouts

SteveITS

known issue for years that registering dhcp restarts unbound

ref: https://redmine.pfsense.org/issues/5413

thundergate

@johnpoz Oh no - That's stupid?!

But I do need those DHCP leases to be seen to know what device does make all those requests.... Cannot stand with IP addresses only.

Used OPNsense before - but didn't had those issues, if I remember correctly?

SteveITS

@thundergate Then until resolved, as I noted above make your lease time longer. It will restart on average every ( (lease duration/2) / # leases ).

thundergate

@steveits Will have to look into it.

I'm quite disappointed. Never thought that such an error does exist within pfSense (and it does exist since a few years now).

Are you all not interested in name resolution and do only handle IPs?

For me unbound restarts every 2-5 minutes (doesn't look like it is the DHCP lease issue at all?!).

johnpoz

@thundergate said in Frequent DNS timeouts:

re you all not interested in name resolution and do only handle IPs?

All of the devices I have that I need to resolve or want to resolve via name I have reserved IP for ;)

johnpoz

@thundergate said in Frequent DNS timeouts:

DHCP lease issue at all?!).

well look in your dhcp log - does it match up or not.. My leases are 4 days long.. But 2 hour lease, with lots of devices yeah you could have a few an hour for sure..

SteveITS

@thundergate said in Frequent DNS timeouts:

not interested in name resolution and do only handle IPs?

Depends on the setup. Clients with Windows domains use Windows DNS so it's handled. Windows in general/SMB will discover an address by NetBIOS name anyway. Printers get static or reservations. So in most cases it isn't really needed.

thundergate

@steveits Main point where I do need it is within my pfBlockerNG logs to see what device is doing which requests and so on...

Gertjan

@thundergate said in Frequent DNS timeouts:

But I do need those DHCP leases to be seen to know what device does make all those requests....

DHCP will still work.
The only thing that doesn't happen anymore is that their, mostly stupid host names, like AZERFDGHH, and you know what that is : the doorbell, and 6 devices that all are called 'Android' and four 'iPad' and whatever, will pollute your DHCP leases list
If you really want to control, and start even to pretend that you want to know what devices belongs to who : "who is what and when" on your network, then gives them logical names, names you chose, and not the build in device name.
And, oh, before I forget : a lot of devices don't even give a host name to enter into the DNS, but the resolver will get restarted anyway ....
That issue is also solved .... by the pfSense admin, you, of course.

So : take a first step : list all the MAC addresses, and give them all names that you understand.
At the end, pfSense will contain a list with all the devices that you , and names that you can easily remember.
On the device side, for every device : you have nothing to do, as most use DHCP out of the box.
The day you find a device that was using a IP out of the DHCP server pool, you know on the spot that you have a new device on your network.

Static DHCP lease are read into the resolver unbound upon start and will not change. Except the day you add a new device to your network, and create the "MAC IP host name" for it.

[23.01-RELEASE][admin@pfSense.what-a-mess.tld]/root: unbound-control -c /var/unbound/unbound.conf status
version: 1.17.1
verbosity: 1
threads: 2
modules: 3 [ python validator iterator ]
uptime: 111205 seconds
options: control(ssl)
unbound (pid 24788) is running...

That's a bit more as 3 days for me, when I was testing UPS shutdown procedure.

@thundergate said in Frequent DNS timeouts:

For me unbound restarts every 2-5 minutes (doesn't look like it is the DHCP lease issue at all?!).

Actually, I hope for you that this is your issue.
If it's not : entering into the light the other X reasons why unbound gets restarted :
You WAN, or LAN or any other interface is bad, goes up and down all the time.
This will restart unbound, and many other processes also.
Evey x minutes ..
Not a good thing.
Or unbound is plain 'bad' : less plausible, as me and you use the same code : days without restart is possible.
Another reason : it has been seen that people wanted to update their pfblockerng feeds every hours or so. If the any of these lists actually changed => unbound gets restarted.

And then this example : remember that stupid doorbell mentioned above : it was to cheap, it had a broken dhcp client, it was asking a new lease every minute .... The pfsense admin was posting here, as he had checked "DCP registration" and did not look into the logs to see that that doorbell was asking a new lease every xx seconds.

Also : people don't feel or notice radio waves. Device do, as they need it for the wifi connection. When the device is at the edge of reach ability, the link gets set going up and down every x seconds. On every linkup, a dhpc request is fired. Your phone has now become a pfsense unbound killer.

Now you know why the DHCP registration is, by default, not checked.
Now you now (parts) of what need to be checked once in a while, before you check it.

I was hoping for a more permanent solution, years ago.
I'm not waiting any more I solved the issue for me, on my side. And DNS rocks, for me.

johnpoz

@gertjan said in Frequent DNS timeouts:

did not look into the logs to see that that doorbell was asking a new lease every xx seconds.

I have seen this - client just asks and asks and asks.. Even when they just got a lease good for hours or even days.

thundergate

Thx for all the feedback.

I did turn of register DHCP leases and will now start to add them by myself. As far as I understand it's a 'one time job' and then the client does have a static lease/IP and that's it?!

johnpoz

@thundergate yup set it and shouldn't have to touch it again unless you want that device to have a different IP, or you want to hand out something specific to that device different than your normal scope etc..

Look at this POS device

Mar 16 01:38:52 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:38:52 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2
Mar 16 01:37:41 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:37:41 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2
Mar 16 01:31:44 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:31:44 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2
Mar 16 01:30:01 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:30:01 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2
Mar 16 01:29:20 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:29:20 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2
Mar 16 01:19:00 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:19:00 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2
Mar 16 01:18:23 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:18:23 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2
Mar 16 01:17:49 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:17:49 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2
Mar 16 01:13:43 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:13:43 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2
Mar 16 01:11:04 	dhcpd 	93450 	DHCPACK on 192.168.2.203 to 88:b2:91:98:d6:f0 via igb2
Mar 16 01:11:04 	dhcpd 	93450 	DHCPREQUEST for 192.168.2.203 from 88:b2:91:98:d6:f0 via igb2

Thats my wife's shitty iphone, charging..

You have some device doing that - going to cause unbound to go crazy restarting like that..

A reservation doesn't stop them from asking.. But you can not resolve it, and not have to worry about registering dhcp dynamic clients in unbound.

I really should prob get the wifi just to turn off her wifi when she is charging it... I looked and it did it last night as well..

edit: looking at my wifi log, her phone is roaming between 2 different APs it keeps flipping back and forth - this is what is most likely causing the dhcp - maybe I can get here to move where she is charging it but the rssi shouldn't be switching between ap like that..

Gertjan

@johnpoz said in Frequent DNS timeouts:

....
edit: looking at my wifi log, her phone is roaming between 2 different APs it keeps flipping back and forth - this is what is most likely causing the dhcp - maybe I can get here to move where she is charging it but the rssi shouldn't be switching between ap like that..

Bigger issues are on the horizon.
iPhone 'decides' to backup their content "when they are charging, have wifi, feel happy, and who knows what other criteria have to be met". That is, when you have the 1 $/€ monthly Apple backup plan, which permits you to restore on a new iPhone with one click - no messages photos ( ! ) apps and settings lost if something happens with the current one.
Believe me, this 1$ solution is way better as what a lawyer will ask you ;)

The wifi hopping : true : to much wifi is killing the wifi.
She could disable the "auto connect" on all overlapping home wifi SSID's except for one and you DHCP issue will be solved.

Btw : here, where I work, I've 4 AP's using the same SSID, as its the wifi access with a captive portal for our hotel. I see this hopping a lot, as people tend to move in the building.
Our captive portal has its own network and its own DHCP server.
And don't want to see what their news are, as, for me, it's a non trusted network

@thundergate said in Frequent DNS timeouts:

As far as I understand it's a 'one time job' and then the client does have a static lease/IP and that's it?!

Exact.
The device will say : "he, I'm aa:bb:cc:dd:ee:ff and do you have an IP for me" and pfSense will hand over the IP you've selected for it. And not an IP from the DHCP pool.
Most device will even tell ask for that same IP in the future.
Nice side effect : you will know from now on that your NAS has 192.168.1.10 from now on.
And unbound doesn't get restarted.

johnpoz

@gertjan said in Frequent DNS timeouts:

on all overlapping home wifi SSID's except for one and you DHCP issue will be solved.

She is not jumping ssids.. she is moving from 1 AP to another one.. From looking she is right at the cusp of the min rssi I had set.. Tmrw I will put the developers tool on her phone so I can see what she is seeing for the signal strengh.. But I bumped the min rssi a few dbm and it seems to have settled in to 1 AP now.

And it settles down after a bit.. I am not having any issues, I just noticed my wifes phone doing that and thought it was a perfect example what could cause unbound to restart if your registering dhcp, which I am not.. And here phone has a reservation..

Gertjan

@johnpoz said in Frequent DNS timeouts:

She is not jumping ssids.

I understood that. The device is hopping around as the current SSID becomes less good as the surrounding available SSIDs, already known, so it hops over.
And the process repeats.

johnpoz

@gertjan it stop doing it once I changed the min rssi from -67 to -73

JonH

@gertjan said in Frequent DNS timeouts:

So : take a first step : list all the MAC addresses, and give them all names that you understand.

I'm missing something here.

I have the same issues as OP & @thundergate. I'd call it 'hangs', not timeouts.
I've read & read thread after thread. I don't have dnssec set. I don't have dhcp leases registered. My resolver simply hangs for around 5-6 minutes and then starts working.

My dns is Lo, 9.9.9.9, 149.112.112.112
I've tried switching it to 8.8.8.8 and 1.1.1.1, and openDNS. None of them work any better.

My current solution is a cron job in pfSense to restart unbound every 30 min. This is NOT what I want to do, it is a convenience for me. However, it still will hang once in a great while in between my 30 min restarts.

I want to stress that my problem is a hang, not a restart. The upthead suggestion to use grep to search for restart isn't worthwhile in a log that rolls over in an hour or two. I've checked in the past and don't recall a restart in the log.

I've tried jacking up the log level in Unbound but that has not given me any hints because I'm not savvy in reading all this stuff even with google helping with some of it. If I go up above level 3 it stops logging (I guess it needs to start in a different mode?). What I see is it starts getting a lot of servfail (not sure, but believe this is logged for dnsbl entries also).

This 'problem' with unbound seems to be experienced by very few users on this forum, but it is experienced and so far I have not seen a solution for a user who is using correctly using the resolver & tls and is not using dnssec. They are also using python mode and pfBlockerNG. I did see one reference that pfBlocker had a patch but the package mgr still has the same version I'm using so I don't know any more about that and if it is for this issue.

I would like a better explanation of @Gertjan quoted at the head of this reply because I don't know or misunderstand what he is saying and how to accomplish it. Is this referring to, on Apple, "clientID"? My arp table has a few "host names" (mostly IoT devices) and I do not see a way to get apple devices showing this info (probably because Apple goes to great effort to hide itself). Or is it simply a list of MAC address that can be referred to in order to ID entries?

Before going to v23.01 and switching to the latest pfBlockerNG I did not have these issues.

SteveITS

@jonh I think the context of Gertjan's message was to help that poster to assign static IPs, in order to avoid DHCP registrations, which does restart Unbound.

There are many threads, as you said. In another thread someone suggested/speculated Quad9 was rate limiting or just not answering if there were a lot of DoT requests. At home I have had DoT with Quad9 on 23.01 enabled for a few weeks now, but volume is not high. I haven't tried DoT on, at other places, but we've had no reports of DNS issues with it off. I did have to turn off DNSSEC on 23.01, which Quad9 themselves say will cause problems when forwarding. Others have posted disabling DNSSEC didn't help but disabling DoT did.

JonH

@steveits said in Frequent DNS timeouts:

At home I have had DoT with Quad9 on 23.01 enabled for a few weeks now, but volume is not high.

That's interesting. My DoT subnet has only 3 states set with 12MiB with an uptime of < 2 days.
I'm not sure that is a good indicator but it does seem maybe it is high. But again, this was not happening prior to my upgrade. As for Quad9 rate limiting, if that is the reason the it implies the other 3 DNS/TLS servers I tried also may be rate limiting. I have not seen that in my logs when I was looking for a common cause for this problem.

I want to be able to access my DoT via Apple's Homekit while away from home.

Here is a snippet from resolver.log for a typical 'hang'

Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.143 p39-imap.mail.me.com.akadns.net. AAAA IN SERVFAIL 0.000000 0 49
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: resolving p39-imap.mail.me.com.akadns.net. A IN
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.143 p39-imap.mail.me.com.akadns.net. A IN SERVFAIL 0.000000 0 49
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: resolving jimap.imap.mail.yahoo.com. AAAA IN
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.143 jimap.imap.mail.yahoo.com. AAAA IN SERVFAIL 0.000000 0 43
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: resolving jimap.imap.mail.yahoo.com. A IN
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.143 jimap.imap.mail.yahoo.com. A IN SERVFAIL 0.000000 0 43
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: resolving www.chevybolt.org. HTTPS IN
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.7 www.chevybolt.org. HTTPS IN SERVFAIL 0.000000 0 35
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: resolving www.chevybolt.org. A IN
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.7 www.chevybolt.org. A IN SERVFAIL 0.000000 0 35
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: resolving config.htplayground.com. HTTPS IN
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.7 config.htplayground.com. HTTPS IN SERVFAIL 0.000000 0 41
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: resolving config.htplayground.com. A IN
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.7 config.htplayground.com. A IN SERVFAIL 0.000000 0 41
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.143 p39-imap.mail.me.com.akadns.net. AAAA IN SERVFAIL 0.000000 1 49
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.143 jimap.imap.mail.yahoo.com. AAAA IN SERVFAIL 0.000000 1 43
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.143 p39-imap.mail.me.com.akadns.net. A IN SERVFAIL 0.000000 1 49
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.143 jimap.imap.mail.yahoo.com. A IN SERVFAIL 0.000000 1 43
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.7 www.chevybolt.org. HTTPS IN SERVFAIL 0.000000 1 35
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.7 www.chevybolt.org. A IN SERVFAIL 0.000000 1 35
Mar 17 11:06:59 pfSense unbound[77573]: [77573:1] info: 192.168.10.7 config.htplayground.com. HTTPS IN SERVFAIL 0.000000 1 41

This particular hang lasted 2.5 minutes before I manually restarted unbound.

Here is the final sequence during a restart of unbound:

Mar 17 11:07:21 pfSense unbound[77573]: [77573:0] info: [pfBlockerNG]: pfb_unbound.py script exiting
Mar 17 11:08:12 pfSense unbound[35852]: [35852:0] notice: init module 0: python
Mar 17 11:08:12 pfSense unbound[35852]: [35852:0] info: [pfBlockerNG]: pfb_unbound.py script loaded
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: [pfBlockerNG]: init_standard script loaded
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] notice: init module 1: iterator
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: start of service (unbound 1.17.1).
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: resolving ocsp2.apple.com. HTTPS IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: resolving ocsp2.apple.com. A IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: resolving amp-api-edge.apps.apple.com. HTTPS IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: resolving ocsp2.apple.com. AAAA IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: resolving amp-api-edge.apps.apple.com. AAAA IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: resolving amp-api-edge.apps.apple.com. A IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: response for amp-api-edge.apps.apple.com. A IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: reply from <.> 149.112.112.112#853
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: query response was CNAME
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: resolving amp-api-edge.apps.apple.com. A IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: response for ocsp2.apple.com. AAAA IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: reply from <.> 9.9.9.9#853
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: query response was CNAME
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: resolving ocsp2.apple.com. AAAA IN
Mar 17 11:08:14 pfSense unbound[35852]: [35852:0] info: response for ocsp2.apple.com. HTTPS IN

And after the restart I'm back up and running.

SteveITS

@jonh said in Frequent DNS timeouts:

As for Quad9 rate limiting, if that is the reason the it implies the other 3 DNS/TLS servers I tried also may be rate limiting

It occurs to me that if (if) it is rate limiting on the remote end, restarting Unbound probably wouldn't fix it. However if connections are being held open for some reason (e.g. rate limiting? bug?) and Unbound stops connecting out, or gets connection refusals, that could explain both the "self recover" and "restart to recover" behavior...?