pfSense resolver stops working
-
@johnpoz thank you for your reply once more.
Unfortunately and as mentioned the instance where the issue is happening does not have pfblockerng (I do on my setup at home, lots of lists, I know what you're talking about, just that's not the case on this instance).
All things considered I'd say this instance has very little load to provoke that.
Load average is 0.79, 0.68, 0.58
CPU is almost ever below 4%Also you said this that got me thinking:
Not saying turning off register dhcp is correct solution, but it is a work around for some users have have issues with unbound constantly restarting to the point that they notice issues with their dns.
But when I am trying to resolve and it's broken, it doesn't matter if it is 1 second, 2 seconds, 60 seconds or 10 minutes. Unbound only recovers after the service being manually restarted. And when I open the pfSense, it does not show the service as stopped, but rather running.
What I got from your explanation was that in those cases, due to large lists and many potential entries to be processed, that unbound would take a little longer to start. Ok, so after a bit it would work I guess?
One final question,
DNS Resolver has these options:DHCP Registration - Register DHCP leases in the DNS Resolver
Static DHCP - Register DHCP static mappings in the DNS ResolverWhen I go to the DHCP leases page, all the entries are static, some online other offline.
Being all static, and being none dynamic, being them already at the table, let's consider the dhcp lease despite static expired (that they do, cause they have a timer, just get the same lease), and so it renews, does this make unbound to restart? -
@maverickws said in pfSense resolver stops working:
Unbound only recovers after the service being manually restarted
This shouldn't happen, and something else is going on.
No static leases are only loaded when unbound starts, there would never be a restart on those. It loads those entries - doesn't care if the client is online or not, or if it renews, etc.
My unbound never restarts, unless I restart it for something.. Last time I looked, it was over for 10+ days.. And now its up for 6 some days... That was the last time I restarted it doing something for a thread here.
[22.05-RELEASE][admin@sg4860.local.lan]/root: unbound-control -c /var/unbound/unbound.conf status version: 1.15.0 verbosity: 1 threads: 4 modules: 2 [ validator iterator ] uptime: 530660 seconds options: control(ssl) unbound (pid 83047) is running... [22.05-RELEASE][admin@sg4860.local.lan]/root:
530k seconds = 6+ days.
-
@johnpoz Ok amazing thanks for clearing that out!
So the situation is that:
- It's a base install, with no extra packages other than
Service Watchdog
; - Low load;
- Unbound stops responding, but shows running;
- The unbound resolver never recovers on its own doesn't matter how long its been;
- All the clients have static leases.
From here where can I head to find what's going on?
- It's a base install, with no extra packages other than
-
@maverickws if your saying its running, but not responding..
So when you query directly, you get a timeout, NX, refused? Do you have unbound using what interfaces - did you have an interface go down, like wan or vpn, or whatever?
It resolves nothing, not pfsense own name? Or local - or doesn't resolve public stuff like google.com?
example I just send a empty query to unbound from my pc, and I get back roots.. Your saying this fails? With what error, timeout?
$ dig @192.168.9.253 ; <<>> DiG 9.16.30 <<>> @192.168.9.253 ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9582 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 13, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 4096 ;; QUESTION SECTION: ;. IN NS ;; ANSWER SECTION: . 74099 IN NS a.root-servers.net. . 74099 IN NS b.root-servers.net. . 74099 IN NS c.root-servers.net. . 74099 IN NS d.root-servers.net. . 74099 IN NS e.root-servers.net. . 74099 IN NS f.root-servers.net. . 74099 IN NS g.root-servers.net. . 74099 IN NS h.root-servers.net. . 74099 IN NS i.root-servers.net. . 74099 IN NS j.root-servers.net. . 74099 IN NS k.root-servers.net. . 74099 IN NS l.root-servers.net. . 74099 IN NS m.root-servers.net. ;; Query time: 0 msec ;; SERVER: 192.168.9.253#53(192.168.9.253) ;; WHEN: Tue Jul 26 09:27:22 Central Daylight Time 2022 ;; MSG SIZE rcvd: 239
If I do empty nslookup I get back pfsense name
$ nslookup Default Server: sg4860.local.lan Address: 192.168.9.253
-
I actually tested directly with
dig
but I don't remember exactly what was the error on the answer. Did that a couple of days ago the terminal output with that is long gone.I'll wait until the issue occurs again and will check it out, also notice if there were interface changes.
Regarding which interfaces unbound is using: All.
I am not entirely sure about if it fails only on public stuff or local as well. I'll update with that detailed info as soon as possible!
edit:
If I do an emptynslookup
it only returns an interactive prompt. This is the default behaviour isn't it? Got a bit confused now.
If I add something it will return on the top section that the server is the pfsense IP (gets it fromresolv.conf
) and the non-authorative answer to the query. -
@maverickws might depend on the os your running nslookup on. In windows it returns the IP and name of the default ns its going to ask. Or the one you set it too..
On linux yeah it just returns the prompt > until you lookup something.
to your ALL setting, you might want to change to only what you want to listen on or use.
I have mine to only listen on some of my interfaces, and for outbound I only have it using localhost.. Which will then be natted out my wan connection.
-
ah ok got it, I tested on my machine its macOS and our jump box is red hat 8 so I was wondering.
About the network interfaces bound to resolver, I have left all because ... so, let me do a train of thought:
We pretend it to answer to all interfaces - all local + WAN - since we use resolver access lists and some external devices or segments make queries to the WAN (actually to the WAN CARP VIP so here WAN would be the VIP from the list).
Beside the usual interfaces, we have:
One transparent bridge on a dedicated interface, and like 12 VIP's.Would you advice to specifically select only the interfaces like LAN/DMZ etc, and corresponding IPv6 Link-locals, and replies only over localhost?
Regarding the Outgoing Network Interfaces for the replies, do these replies over localhost, if the query is done from the LAN, will the reply be netted through the outgoing VIP in this case? Or would localhost figure the LAN interface is more appropriated for this reply?
-
@maverickws if you outgoing is localhost it should go out whatever your default is.. Per your routing.. Be that natted to some public, or via some internal interface IP.
-
@johnpoz alright have to say I never paid much attention to these interface options. I noticed despite being listening to all interfaces, on the external side only devices from the access list would actually get a reply, so never changed that...
I'm going to change that config now.
On the listening interfaces I'll leave:
LAN IPv6 Link-Local
PUB IPv6 Link-Local
CARP LAN VIP
CARP PUB VIP
CARP WAN VIP
DMZ VIPAnd on the outgoing interfaces that'll be only "localhost".
(if localhost has to choose between the interface IP, or the VIP which is the gateway for the interface, which will it prefer? VIP or IP?)
EDIT:
I updated the settings selecting the mentioned interfaces, when I hit save I get an error:
The following input errors were detected: This system is configured to use the DNS Resolver as its DNS server, so Localhost or All must be selected in Network Interfaces.
So I've also selected localhost on the listening interfaces. You also got that on your selection so my bad!
-
@maverickws yeah you have to listen on local host if you want pfsense to be able to use 127.0.01 which is default ;)
-
@johnpoz
yeah it makes sense!!Ok so anyway I've changed these configs as reported on the thread here, I did not disable the DHCP Leases option yet as I'm waiting to see if it fails again so I can do some more checks in regard to those points you mentioned above.
Will stay on top of this and if the issue keeps occurring or when it occurs I'll get back!
thanks a lot for all the feedback so far! have a great one
-
In about 3 months I had the DNS Resolver stop working 3 times. Definitely an issue!
Very basic setup. pfSense 2.6.0 on metal.
I do not have "Register DHCP leases in the DNS Resolver" checked. Service restart fixes it until the next time.
Options that are "on":- Respond to incoming SSL/TLS queries from local clients
- Enable DNSSEC Support
- Enable Forwarding Mode
- Use SSL/TLS for outgoing DNS Queries to Forwarding Servers
-
@ik13 said in pfSense resolver stops working:
Enable DNSSEC Support
Enable Forwarding ModeThis is by no means a good setup - have been over it and over it here multiple times.. If your going to forward there is zero reason to have dnssec checked, where you forward to either does it or it doesn't. Having that checked does nothing but cause extra queries and problems.
-
@maverickws said in pfSense resolver stops working:
I did not disable the DHCP Leases option yet as I'm waiting to see if it fails again
Just keep in mind what will happen with this option checked :
Every time a device on any of your LAN's asks or renew a lease, unbound will get restarted.You have just one LAN, a and a couple of devices, this will happen half way the duration of every lease, or when a device gets disconnected, and reconnect (think of phones and other wifi device).
Run this on the console to see unbound stopping :
grep -E -i '(start|stopped)' /var/log/resolver.log
Again, it's not only a DHCP event that restarts unbound, it can also be pfblockerng-devel, or an interface down+up event.
These unbound stop+starts are not bad, but, if they take some time, they will leave your network without DNS for a couple of moments.
And what I don't like at all : unbound receives a stop command. It will not just stop, it will take down all the memory structures, caches etc, this takes time. Same thing when it start, it has a lot to do.
What happened if there was another stop event coming in ? And another one right at this moment ? At best you have a lot of race conditions. And that is .... well, that's a situation I don't like at all. I was a programmer in my previous live (C, C++, C# etc)I have this for the last month or so :
[22.05-RELEASE][admin@pfSEnse.my-site.net]/root: grep -E -i '(start|stopped)' /var/log/resolver.log <30>1 2022-06-19T00:18:12.638505+02:00 pfSEnse.my-site.net unbound 61884 - - [61884:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-19T00:18:15.294608+02:00 pfSEnse.my-site.net unbound 86881 - - [86881:0] info: start of service (unbound 1.13.2). <30>1 2022-06-19T11:10:42.240792+02:00 pfSEnse.my-site.net unbound 86881 - - [86881:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:29:21.972768+02:00 pfSEnse.my-site.net unbound 22445 - - [22445:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:29:42.732938+02:00 pfSEnse.my-site.net unbound 22445 - - [22445:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:30:35.984183+02:00 pfSEnse.my-site.net unbound 43631 - - [43631:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:31:25.988265+02:00 pfSEnse.my-site.net unbound 43631 - - [43631:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:32:01.804532+02:00 pfSEnse.my-site.net unbound 39746 - - [39746:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:32:18.435160+02:00 pfSEnse.my-site.net unbound 39746 - - [39746:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:32:20.879029+02:00 pfSEnse.my-site.net unbound 23048 - - [23048:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:32:27.600193+02:00 pfSEnse.my-site.net unbound 23048 - - [23048:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:32:30.041500+02:00 pfSEnse.my-site.net unbound 18498 - - [18498:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:32:46.722051+02:00 pfSEnse.my-site.net unbound 18498 - - [18498:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:32:49.996440+02:00 pfSEnse.my-site.net unbound 82950 - - [82950:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:32:52.506637+02:00 pfSEnse.my-site.net unbound 82950 - - [82950:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:32:56.071159+02:00 pfSEnse.my-site.net unbound 58776 - - [58776:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:33:02.527865+02:00 pfSEnse.my-site.net unbound 58776 - - [58776:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:33:16.489983+02:00 pfSEnse.my-site.net unbound 42004 - - [42004:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:33:25.849070+02:00 pfSEnse.my-site.net unbound 42004 - - [42004:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:33:29.721989+02:00 pfSEnse.my-site.net unbound 88115 - - [88115:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:33:34.115088+02:00 pfSEnse.my-site.net unbound 88115 - - [88115:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:33:35.895850+02:00 pfSEnse.my-site.net unbound 7800 - - [7800:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:38:33.501144+02:00 pfSEnse.my-site.net unbound 7800 - - [7800:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:38:36.071832+02:00 pfSEnse.my-site.net unbound 390 - - [390:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:38:41.846193+02:00 pfSEnse.my-site.net unbound 390 - - [390:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-21T02:38:45.098570+02:00 pfSEnse.my-site.net unbound 9309 - - [9309:0] info: start of service (unbound 1.13.2). <30>1 2022-06-21T02:38:52.371103+02:00 pfSEnse.my-site.net unbound 9309 - - [9309:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-22T16:46:33.383773+02:00 pfSEnse.my-site.net unbound 53212 - - [53212:0] info: start of service (unbound 1.13.2). <30>1 2022-06-22T16:46:52.584831+02:00 pfSEnse.my-site.net unbound 53212 - - [53212:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-22T16:46:55.806446+02:00 pfSEnse.my-site.net unbound 31030 - - [31030:0] info: start of service (unbound 1.13.2). <30>1 2022-06-22T16:50:30.755244+02:00 pfSEnse.my-site.net unbound 31030 - - [31030:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-22T16:50:32.673715+02:00 pfSEnse.my-site.net unbound 86962 - - [86962:0] info: start of service (unbound 1.13.2). <30>1 2022-06-22T16:52:26.575190+02:00 pfSEnse.my-site.net unbound 86962 - - [86962:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-22T16:52:28.480685+02:00 pfSEnse.my-site.net unbound 83093 - - [83093:0] info: start of service (unbound 1.13.2). <30>1 2022-06-22T16:54:13.110429+02:00 pfSEnse.my-site.net unbound 83093 - - [83093:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-22T16:54:15.008855+02:00 pfSEnse.my-site.net unbound 41447 - - [41447:0] info: start of service (unbound 1.13.2). <30>1 2022-06-22T16:59:55.007531+02:00 pfSEnse.my-site.net unbound 41447 - - [41447:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-22T16:59:56.932470+02:00 pfSEnse.my-site.net unbound 2988 - - [2988:0] info: start of service (unbound 1.13.2). <30>1 2022-06-23T17:15:38.524324+02:00 pfSEnse.my-site.net unbound 2988 - - [2988:0] info: service stopped (unbound 1.13.2). <30>1 2022-06-23T17:15:40.408052+02:00 pfSEnse.my-site.net unbound 49970 - - [49970:0] info: start of service (unbound 1.13.2). <30>1 2022-06-27T00:00:42.045699+02:00 pfSEnse.my-site.net unbound 49970 - - [49970:0] info: service stopped (unbound 1.13.2). <30>1 2022-07-02T00:00:48.060874+02:00 pfSEnse.my-site.net unbound 97180 - - [97180:0] info: start of service (unbound 1.15.0). <30>1 2022-07-05T00:00:33.888195+02:00 pfSEnse.my-site.net unbound 97180 - - [97180:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-05T00:00:35.778522+02:00 pfSEnse.my-site.net unbound 3931 - - [3931:0] info: start of service (unbound 1.15.0). <30>1 2022-07-08T10:22:30.888264+02:00 pfSEnse.my-site.net unbound 3931 - - [3931:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-08T10:22:34.399240+02:00 pfSEnse.my-site.net unbound 685 - - [685:0] info: start of service (unbound 1.15.0). <30>1 2022-07-14T00:00:50.029350+02:00 pfSEnse.my-site.net unbound 83848 - - [83848:0] info: start of service (unbound 1.15.0). <30>1 2022-07-17T00:00:54.309024+02:00 pfSEnse.my-site.net unbound 83848 - - [83848:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-17T00:00:56.408997+02:00 pfSEnse.my-site.net unbound 54222 - - [54222:0] info: start of service (unbound 1.15.0). <30>1 2022-07-18T00:00:58.241048+02:00 pfSEnse.my-site.net unbound 54222 - - [54222:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-18T00:01:00.444239+02:00 pfSEnse.my-site.net unbound 22032 - - [22032:0] info: start of service (unbound 1.15.0). <30>1 2022-07-18T11:32:19.600367+02:00 pfSEnse.my-site.net unbound 22032 - - [22032:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-18T11:32:26.992179+02:00 pfSEnse.my-site.net unbound 8790 - - [8790:0] info: start of service (unbound 1.15.0). <30>1 2022-07-19T13:00:20.983512+02:00 pfSEnse.my-site.net unbound 8790 - - [8790:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-19T13:00:22.083823+02:00 pfSEnse.my-site.net unbound 18042 - - [18042:0] info: start of service (unbound 1.15.0). <30>1 2022-07-20T12:00:34.296337+02:00 pfSEnse.my-site.net unbound 18042 - - [18042:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-20T12:00:36.365909+02:00 pfSEnse.my-site.net unbound 24358 - - [24358:0] info: start of service (unbound 1.15.0). <30>1 2022-07-21T00:00:38.712404+02:00 pfSEnse.my-site.net unbound 24358 - - [24358:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-21T00:00:40.611149+02:00 pfSEnse.my-site.net unbound 58053 - - [58053:0] info: start of service (unbound 1.15.0). <30>1 2022-07-21T10:45:44.144614+02:00 pfSEnse.my-site.net unbound 58053 - - [58053:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-21T10:45:46.047555+02:00 pfSEnse.my-site.net unbound 93413 - - [93413:0] info: start of service (unbound 1.15.0). <30>1 2022-07-21T14:29:28.563979+02:00 pfSEnse.my-site.net unbound 93413 - - [93413:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-21T14:29:32.039512+02:00 pfSEnse.my-site.net unbound 70486 - - [70486:0] info: start of service (unbound 1.15.0). <30>1 2022-07-21T14:29:44.413710+02:00 pfSEnse.my-site.net unbound 70486 - - [70486:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-21T14:29:47.879285+02:00 pfSEnse.my-site.net unbound 50669 - - [50669:0] info: start of service (unbound 1.15.0). <30>1 2022-07-21T14:30:17.753945+02:00 pfSEnse.my-site.net unbound 50669 - - [50669:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-21T14:30:21.189694+02:00 pfSEnse.my-site.net unbound 24816 - - [24816:0] info: start of service (unbound 1.15.0). <30>1 2022-07-25T00:00:41.414948+02:00 pfSEnse.my-site.net unbound 24816 - - [24816:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-25T00:00:43.444923+02:00 pfSEnse.my-site.net unbound 53748 - - [53748:0] info: start of service (unbound 1.15.0). <30>1 2022-07-25T08:52:18.808277+02:00 pfSEnse.my-site.net unbound 53748 - - [53748:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-25T08:52:19.890207+02:00 pfSEnse.my-site.net unbound 49444 - - [49444:0] info: start of service (unbound 1.15.0). <30>1 2022-07-25T08:53:59.193396+02:00 pfSEnse.my-site.net unbound 49444 - - [49444:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-25T08:53:59.618754+02:00 pfSEnse.my-site.net unbound 78030 - - [78030:0] info: start of service (unbound 1.15.0). <30>1 2022-07-25T08:54:25.182703+02:00 pfSEnse.my-site.net unbound 78030 - - [78030:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-25T08:54:27.240339+02:00 pfSEnse.my-site.net unbound 77318 - - [77318:0] info: start of service (unbound 1.15.0). <30>1 2022-07-26T09:27:12.426763+02:00 pfSEnse.my-site.net unbound 77318 - - [77318:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-26T09:27:16.229963+02:00 pfSEnse.my-site.net unbound 17569 - - [17569:0] info: start of service (unbound 1.15.0). <30>1 2022-07-26T09:32:45.480474+02:00 pfSEnse.my-site.net unbound 17569 - - [17569:0] info: service stopped (unbound 1.15.0). <30>1 2022-07-26T09:32:48.194044+02:00 pfSEnse.my-site.net unbound 74846 - - [74846:0] info: start of service (unbound 1.15.0).
Every "stopped" line should be followed by a "start" line like this :
<30>1 2022-07-26T09:32:48.194044+02:00 pfSense.brit-hotel-fumel.net unbound 74846 - - [74846:0] info: start of service (unbound 1.15.0).
after all, unbound is never just stopped. That only happens when you shut the system down, or you disable unbound in the GUI yourself.
All other process actions are stop + start actions.What I propose : if stopping + starting raises the chance of finding unbound dead in the water, then I vote for lowering the number of these stops and starts.
Disabling "DHCP Leases option" is just one quick way to do this.
There are some redmine bug reports about this.
Better solution have been proposed, like using unboundctl to insert new host names into unbound the DNS cache when they get announced by the DHCP server (if they gave a valid non empty host name, as this is often the case, so no intercation is needed with DNS).Without any proof, I think that arm based devices are more sensible to this issues.
@ik13 : arm or intel ? -
@gertjan said in pfSense resolver stops working:
Without any proof, I think that arm based devices are more sensible to this issues.
I think you might be on to something there, and I also think using tls forwarding doesn't help either..
-
@johnpoz said in pfSense resolver stops working:
and I also think using tls forwarding doesn't help either
Like the TLS hardware support that 'blocks' as seen several times, and only a power down - 10 seconds - power up can make it available to the system again.
Not being able to make a TLS connection, and thus not being able to contact the update servers of Netgate is a known visible part of the issue.
DNS failing to make TLS connections, or even blocking on it, would be a nasty thing.I'm totally not know where I'm talking about of course.
DNSSEC also 'signes' stuff, and checks signatures, so it uses TLS ? In that case ....
-
@gertjan said in pfSense resolver stops working:
so it uses TLS ? In that case ....
dnssec is not tls based.. The traffic between the dns and the client is not encrypted, the records and info are just signed, and can be verified with the public key.
-
@johnpoz said in pfSense resolver stops working:
The traffic between
Correct. DNSSEC traffic is send over the wire in clear.
But it's the "check the crypting", the check of hashes, signing keys etc that makes me think : is the same openssl library used ? Guess so : https://www.cloudflare.com/dns/dnssec/dnssec-complexities-and-considerations/
And if so, is openssl using hardware for this, if aviable ? The same hardware it uses for "AES" TLS etc. -
@johnpoz said in pfSense resolver stops working:
@maverickws if your saying its running, but not responding..
So when you query directly, you get a timeout, NX, refused? Do you have unbound using what interfaces - did you have an interface go down, like wan or vpn, or whatever?
It resolves nothing, not pfsense own name? Or local - or doesn't resolve public stuff like google.com?
example I just send a empty query to unbound from my pc, and I get back roots.. Your saying this fails? With what error, timeout?
Alright so back to the issue again. It happened again yesterday at local time 17:46 GMT +1 (daylight savings) - Not resolving.
In the meanwhile it recovered and I waited until it failed again.
Today it didn't recover, I had to restart the unbound service manually, and before I did all of the remaining tests.- DNS Resolver System logs. Yesterday had no entries since 16:50 (issue occurred at ~17:45) and today no entries on the log since 10:35 and the issue occurred after 12h00). No start/stops;
- Last of process dhcpleases is also of 16:50, today of 10:36;
- No interface changes or other issues within the timeframe where the issue started occurring, let's say the last 20 minutes;
Last entries on resolver log:
Time Process PID Message Jul 26 16:50:08 unbound 94538 [94538:0] info: generate keytag query _ta-4f66. NULL IN Jul 26 16:50:07 unbound 94538 [94538:0] info: start of service (unbound 1.15.0). Jul 26 16:50:07 unbound 94538 [94538:0] notice: init module 1: iterator Jul 26 16:50:07 unbound 94538 [94538:0] notice: init module 0: validator Jul 26 16:50:07 unbound 94538 [94538:0] notice: Restart of unbound 1.15.0.
On the general log:
Time Process PID Message Jul 26 17:36:00 sshguard 75200 Now monitoring attacks. Jul 26 17:36:00 sshguard 67697 Exiting on signal. Jul 26 17:10:00 sshguard 67697 Now monitoring attacks. Jul 26 17:10:00 sshguard 26927 Exiting on signal.
When I do
dig
to the interface CARP VIP without any query:# dig @10.0.0.254 ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> @10.0.0.254 ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 45328 ;; flags: qr rd ra ad; QUERY: 1, ANSWER: 13, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1332 ;; QUESTION SECTION: ;. IN NS ;; ANSWER SECTION: . 83046 IN NS j.root-servers.net. . 83046 IN NS k.root-servers.net. . 83046 IN NS l.root-servers.net. . 83046 IN NS m.root-servers.net. . 83046 IN NS a.root-servers.net. . 83046 IN NS b.root-servers.net. . 83046 IN NS c.root-servers.net. . 83046 IN NS d.root-servers.net. . 83046 IN NS e.root-servers.net. . 83046 IN NS f.root-servers.net. . 83046 IN NS g.root-servers.net. . 83046 IN NS h.root-servers.net. . 83046 IN NS i.root-servers.net. ;; Query time: 1 msec ;; SERVER: 10.0.0.254#53(10.0.0.254) ;; WHEN: Tue Jul 26 17:46:02 WEST 2022 ;; MSG SIZE rcvd: 239
When I do the
dig
with a query:# dig @10.0.0.254 google.com ; <<>> DiG 9.11.36-RedHat-9.11.36-3.el8 <<>> @10.0.0.254 google.com ; (1 server found) ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 10156 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1 ;; OPT PSEUDOSECTION: ; EDNS: version: 0, flags:; udp: 1332 ;; QUESTION SECTION: ;google.com. IN A ;; Query time: 0 msec ;; SERVER: 10.0.0.254#53(10.0.0.254) ;; WHEN: Tue Jul 26 17:47:28 WEST 2022 ;; MSG SIZE rcvd: 39
So the error is
SERVFAIL
.If the query has a fqdn of the internal domain, it returns the results successfully without any error. (
NOERROR
) So it does resolves locally.result of nslookup
# nslookup stackoverflow.com ;; Got SERVFAIL reply from PUB.LIC.DNS.SV0, trying next server ;; Got SERVFAIL reply from PUB.LIC.DNS.SV1, trying next server Server: 10.0.0.254 Address: 10.0.0.254#53 ** server can't find stackoverflow.com: SERVFAIL
After restarting unbound it starts working again.
I have no stopped/started messages anywhere near the time it stops working.EDIT:
In the meanwhile I disabled the Register DHCP Leases option on the Resolver to see how it goes.
But while looking for possible causes behind this, I found this blog article and also got me thinking that, in fact, the issue occurs with domain names that have many CNAME records, as google, stripe, stackoverflow, etc. But really dunno. Just trying to look everywhere to see if the culprit is found.This is causing us major concerns as, for example, customers try to login to websites on servers behind this pfSense, and if the server can't resolve google for the recaptcha - people can't login, if it can't resolve stripe, we get payments issues, so this is creating a bit of a grief.
EDIT2:
I just remembered one of the issues that occur is with our email server, and that record is an A Record not a CNAME so what I mentioned before must be unrelated. -
@maverickws said in pfSense resolver stops working:
So it does resolves locally
This is good info, so unbound is running and can resolve locally - so the trick here is figuring out why servfail on specific domains or fqdns
Not sure exactly what this is
;; Got SERVFAIL reply from PUB.LIC.DNS.SV0, trying next server ;; Got SERVFAIL reply from PUB.LIC.DNS.SV1, trying next server
To me that reads that your forwarding, and where your forwarding sent back fail.
Did you obfuscate the server IP or something with that? Who exactly got asked that reported servfail?
Or is that just the client saying hey I asked these 2 servers and they both reported servfail - and what are those 2 servers?