• Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login
Netgate Discussion Forum
  • Categories
  • Recent
  • Tags
  • Popular
  • Users
  • Search
  • Register
  • Login

pfSense resolver stops working

DHCP and DNS
7
66
15.4k
Loading More Posts
  • Oldest to Newest
  • Newest to Oldest
  • Most Votes
Reply
  • Reply as topic
Log in to reply
This topic has been deleted. Only users with topic management privileges can see it.
  • J
    johnpoz LAYER 8 Global Moderator @maverickws
    last edited by Jul 27, 2022, 12:54 PM

    @maverickws ok that makes more sense ;)

    So when it fails like that with servfail - all things you try fail, or does anything work?

    Problem with servfail is its sort of a catchall - and isn't specific in what exactly failed.. But knowing that local resources are resolving tells us unbound didn't go full belly up.

    Might help to up the verbosity of the unbound logs all the way, but that can be a lot of logging ;)

    I have a 3100 sitting here in a box.. I am thinking of firing it up, and then running some dnsperf testing on it, say have it run through million different queries at like 100 queries a second or something, and then loop that to see if can cause failure.. There are sample files you can download that have 10million records in them to lookup..

    hmmmm - need to check my cal to what real work is going to be like today ;)

    An intelligent man is sometimes forced to be drunk to spend time with his fools
    If you get confused: Listen to the Music Play
    Please don't Chat/PM me for help, unless mod related
    SG-4860 24.11 | Lab VMs 2.7.2, 24.11

    M 1 Reply Last reply Jul 27, 2022, 1:18 PM Reply Quote 0
    • B
      bmeeks
      last edited by bmeeks Jul 27, 2022, 1:21 PM Jul 27, 2022, 1:16 PM

      Here is my suspicion about the unbound problems.

      pfSense is currently running the 1.15.0 version of unbound in the RELEASE branches. That version has a bug that is discussed at length here: https://github.com/NLnetLabs/unbound/issues/670. That bug should be fixed in the latest unbound package version (which is 1.16.1).

      FreeBSD ports has the most recent unbound version (1.16.1). Because unbound is a built-in package within pfSense, I don't think it is easy for them to push an update unless they change the pfSense version.

      And just to be clear, turning on the "Register DHCP Leases" option is also problematic because it results in a ton of unbound restarts. While updating to the latest unbound version, I would also like to see the Netgate team fix the "Register DHCP Leases" option so that it works properly and does not restart the resolver with each lease renewal.

      M 1 Reply Last reply Jul 27, 2022, 1:18 PM Reply Quote 1
      • M
        maverickws @bmeeks
        last edited by Jul 27, 2022, 1:18 PM

        This post is deleted!
        1 Reply Last reply Reply Quote 0
        • M
          maverickws @johnpoz
          last edited by maverickws Jul 27, 2022, 1:20 PM Jul 27, 2022, 1:18 PM

          @johnpoz sorry for not being clearer!

          Ok so I'm not sure what is your question when you say

          So when it fails like that with servfail - all things you try fail, or does anything work?

          What things you mean? Usually it goes like we start catching some errors like captcha stops working or API connection to stripe stops, also our mail server sends warnings on failed resolutions, so we get about that occurrence.
          We then login to our jump box and to the server where the errors come from, could be a web server or the mail server or other, eg. yesterday I was testing on the webserver and jump box and today I was testing on the mail server. On that regard these are VM's, and the hosts where these VM's sit are maybe a thousand clicks apart, the host with the webserver VM is at a DC in Germany, the jumpbox is on one DC in Scandinavia, and the mail server is also in Scandinavia but on another DC room.

          Since I've disabled the DHCP leases option, haven't had any more hiccups.

          EDIT:
          @bmeeks 's comment and issue do seem very to the point.

          J 1 Reply Last reply Jul 27, 2022, 2:06 PM Reply Quote 0
          • G
            Gertjan @maverickws
            last edited by Jul 27, 2022, 1:56 PM

            @maverickws said in pfSense resolver stops working:

            the pfSense's CARP WAN VIP and CARP DMZ VIP.
            ....
            dig @10.0.0.254 google.com
            ....
            Server: 10.0.0.254
            Address: 10.0.0.254#53

            This 10.0.0.254 is a virtual or 'software' defined interface ?
            ( I never used VIP or CARP stuff )

            While failing, what happens when you do the mighty :

            dig @127.0.0.1 google.com
            

            I recall (a couple of years ago) seeing on my own pfSense that "127.0.0.1" didn't exist any more.
            That was bad.
            I wasn't unbound's fault, and unbound didn't like this situation that all.
            I, as an admin, could still 'dig' using any of my LAN IP interfaces.
            I didn't know what killed 127.0.0.1, had to reboot.

            @bmeeks That bug report was already mentioned no so long ago.
            My thoughts : It is an OpenBSD 7 compiled version.
            The fact that "OpenBSD" is mentioned here, means that it is OpenBSD related ?
            One of the unbound coders is posting : wouldn't he know that it could be an "any OS issue" ?

            The patch goes into iterator/iterator.c : that, for me, the core of the resolver.

            Btw : the patch :

            The green 'added' code :

            	iter_mark_cycle_targets(qstate, iq->dp);
            	missing = (int)delegpt_count_missing_targets(iq->dp);
            	log_assert(maxtargets != 0); /* that would not be useful */
            
            	/* Generate target requests. Basically, any missing targets
            	 * are queried for here, regardless if it is necessary to do
            	 * so to continue processing. */
            	if(maxtargets < 0 || maxtargets > missing)
            		toget = missing;
            	else	toget = maxtargets;
            	if(toget == 0) {
            		*num = 0;
            		return 1;
            	}
            

            The removed "red" code

            	iter_mark_cycle_targets(qstate, iq->dp);
            	missing = (int)delegpt_count_missing_targets(iq->dp);
            	log_assert(maxtargets != 0); /* that would not be useful */
            
            	/* Generate target requests. Basically, any missing targets 
            	 * are queried for here, regardless if it is necessary to do 
            	 * so to continue processing. */
            	if(maxtargets < 0 || maxtargets > missing)
            		toget = missing;
            	else	toget = maxtargets;
            	if(toget == 0) {
            		*num = 0;
            		return 1;
            	}
            

            The WTF part : both are identical to me.
            That's what I call a NOP.

            No "help me" PM's please. Use the forum, the community will thank you.
            Edit : and where are the logs ??

            B M 2 Replies Last reply Jul 27, 2022, 2:21 PM Reply Quote 0
            • J
              johnpoz LAYER 8 Global Moderator @maverickws
              last edited by johnpoz Jul 27, 2022, 2:06 PM Jul 27, 2022, 2:06 PM

              @maverickws Yeah I concur with @bmeeks unbound should be updated if there is known issues in the 1.15 that could cause failure, even if not directly related. Any sort of issues that could cause failure

              From that thread, makes mention of

              do-ip6: no

              And that user unable to reproduce the problem... That could be something you could try.. Its easy enough to add to the custom options box.

              What I meant with my question is while you do mention a few domains fail.. Is nothing resolving, do cached entries still work I take it.. When you were testing and seeing servfail - did anything respond, or everything nonlocal you tried was servfail. You can always look in the cache - if there is issue with resolving but cache still works, that is just another piece of the puzzle that could be helpful.

              An intelligent man is sometimes forced to be drunk to spend time with his fools
              If you get confused: Listen to the Music Play
              Please don't Chat/PM me for help, unless mod related
              SG-4860 24.11 | Lab VMs 2.7.2, 24.11

              M 1 Reply Last reply Jul 27, 2022, 2:26 PM Reply Quote 0
              • B
                bmeeks @Gertjan
                last edited by Jul 27, 2022, 2:21 PM

                @gertjan:
                The new code is added to the source file up higher. That code is a type of "limit check". It is called earlier in the revised code than it was in the v1.15.0 code.

                It now makes its test earlier in the processing logic. That is the "fix" for the bug.

                1 Reply Last reply Reply Quote 0
                • M
                  maverickws @johnpoz
                  last edited by maverickws Jul 27, 2022, 2:37 PM Jul 27, 2022, 2:26 PM

                  @johnpoz we do have ipv6 enabled we can look at that but ... not ideal.

                  Please tell me how do we proceed from here to get in touch with Netgate to urge for a patch on the OS to update unbound? It would be nice to get someone's attention to the matter.

                  Right now even having the register dhcp leases option disabled, unbound failed again. I'll take a look into the no-ip6 option now and see if it helps.

                  EDIT:
                  @gertjan sorry I missed your reply!!

                  Ok I'll make that test once it fails again. I'll hold the no-ip6 option for a while and will get back to the dig to localhost on the pfSense

                  This 10.0.0.254 is a virtual or 'software' defined interface ?
                  ( I never used VIP or CARP stuff )

                  The 10.0.0.254 is a Virtual IP type "CARP" on interface LAN network 10.0.0.0/24. the primary pfSense is 10.0.0.1 and the secondary is 10.0.0.2. To ensure traffic continuity, this VIP is used as Gateway and DNS Resolver on the DHCP options for machines that connect to the LAN interface.

                  J 1 Reply Last reply Jul 27, 2022, 2:39 PM Reply Quote 0
                  • M
                    maverickws @Gertjan
                    last edited by Jul 27, 2022, 2:29 PM

                    This post is deleted!
                    1 Reply Last reply Reply Quote 0
                    • J
                      johnpoz LAYER 8 Global Moderator @maverickws
                      last edited by Jul 27, 2022, 2:39 PM

                      @maverickws said in pfSense resolver stops working:

                      we do have ipv6 enabled we can look at that but ... not ideal.

                      keep in mind, that doesn't turn off ipv6 - it just tells unbound to not resolve using IPv6..

                      An intelligent man is sometimes forced to be drunk to spend time with his fools
                      If you get confused: Listen to the Music Play
                      Please don't Chat/PM me for help, unless mod related
                      SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                      M 1 Reply Last reply Jul 27, 2022, 2:43 PM Reply Quote 0
                      • M
                        maverickws @johnpoz
                        last edited by Jul 27, 2022, 2:43 PM

                        @johnpoz I know I know! Its just in the sense that if it is a feature (and crucial on ipv6 only networks - but that isn't the case here) and we're disabling it. So ideal would be not having to do so.

                        I'm just waiting it to blow again to do the dig against localhost as asked by @Gertjan

                        1 Reply Last reply Reply Quote 0
                        • I
                          ik13 @Gertjan
                          last edited by Jul 27, 2022, 4:34 PM

                          @gertjan said in pfSense resolver stops working:

                          Without any proof, I think that arm based devices are more sensible to this issues.
                          @ik13 : arm or intel ?
                          Intel

                          1 Reply Last reply Reply Quote 0
                          • M
                            maverickws
                            last edited by Jul 27, 2022, 7:25 PM

                            @Gertjan

                            [22.05-RELEASE][root@pf.net]/root: dig @127.0.0.1 stackoverflow.com A
                            
                            ; <<>> DiG 9.16.26 <<>> @127.0.0.1 stackoverflow.com A
                            ; (1 server found)
                            ;; global options: +cmd
                            ;; Got answer:
                            ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 63882
                            ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
                            
                            ;; OPT PSEUDOSECTION:
                            ; EDNS: version: 0, flags:; udp: 1332
                            ;; QUESTION SECTION:
                            ;stackoverflow.com.		IN	A
                            
                            ;; Query time: 0 msec
                            ;; SERVER: 127.0.0.1#53(127.0.0.1)
                            ;; WHEN: Wed Jul 27 20:22:50 WEST 2022
                            ;; MSG SIZE  rcvd: 46
                            

                            So logged in as root on the pfSense and doing the dig against localhost also returns SERVFAIL.
                            BTW the CPU here is also x86_64.

                            J 1 Reply Last reply Jul 27, 2022, 7:27 PM Reply Quote 0
                            • J
                              johnpoz LAYER 8 Global Moderator @maverickws
                              last edited by Jul 27, 2022, 7:27 PM

                              @maverickws but local resources resolve, does anything work that is remote - check whats in your cache and try and query something that is currently cached.

                              An intelligent man is sometimes forced to be drunk to spend time with his fools
                              If you get confused: Listen to the Music Play
                              Please don't Chat/PM me for help, unless mod related
                              SG-4860 24.11 | Lab VMs 2.7.2, 24.11

                              M 2 Replies Last reply Jul 27, 2022, 7:33 PM Reply Quote 0
                              • M
                                maverickws @johnpoz
                                last edited by Jul 27, 2022, 7:33 PM

                                @johnpoz ah man I just restarted the service ;_;

                                will check that out tomorrow. in the meanwhile, and seriously, what would be a reasonable expectation for the unbound version to be bumped on a patch release?

                                1 Reply Last reply Reply Quote 0
                                • M
                                  maverickws @johnpoz
                                  last edited by maverickws Jul 28, 2022, 1:49 PM Jul 28, 2022, 9:25 AM

                                  @johnpoz good morning guys,

                                  So this morning we saw the connection to Stripe API failing. Tests from the pfSense:

                                  [22.05-RELEASE][root@pf.net]/root: dig stripe.com A
                                  
                                  ; <<>> DiG 9.16.26 <<>> stripe.com A
                                  ;; global options: +cmd
                                  ;; Got answer:
                                  ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 48056
                                  ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1
                                  
                                  ;; OPT PSEUDOSECTION:
                                  ; EDNS: version: 0, flags:; udp: 1332
                                  ;; QUESTION SECTION:
                                  ;stripe.com.			IN	A
                                  
                                  ;; Query time: 0 msec
                                  ;; SERVER: 127.0.0.1#53(127.0.0.1)
                                  ;; WHEN: Thu Jul 28 10:19:09 WEST 2022
                                  ;; MSG SIZE  rcvd: 39
                                  
                                  [22.05-RELEASE][root@pf.net]/root: nslookup stripe.com
                                  Server:		127.0.0.1
                                  Address:	127.0.0.1#53
                                  
                                  ** server can't find stripe.com: SERVFAIL
                                  

                                  Checked that stackoverflow.com was still on the DNS cache. So if I do that:

                                  [22.05-RELEASE][root@pf.net]/root: dig stackoverflow.com A
                                  
                                  ; <<>> DiG 9.16.26 <<>> stackoverflow.com A
                                  ;; global options: +cmd
                                  ;; Got answer:
                                  ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 9120
                                  ;; flags: qr rd ra; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
                                  
                                  ;; OPT PSEUDOSECTION:
                                  ; EDNS: version: 0, flags:; udp: 1332
                                  ;; QUESTION SECTION:
                                  ;stackoverflow.com.		IN	A
                                  
                                  ;; ANSWER SECTION:
                                  stackoverflow.com.	191	IN	A	151.101.129.69
                                  stackoverflow.com.	191	IN	A	151.101.193.69
                                  stackoverflow.com.	191	IN	A	151.101.65.69
                                  stackoverflow.com.	191	IN	A	151.101.1.69
                                  
                                  ;; Query time: 0 msec
                                  ;; SERVER: 127.0.0.1#53(127.0.0.1)
                                  ;; WHEN: Thu Jul 28 10:23:13 WEST 2022
                                  ;; MSG SIZE  rcvd: 110
                                  
                                  [22.05-RELEASE][root@pf.net]/root: nslookup stackoverflow.com
                                  Server:		127.0.0.1
                                  Address:	127.0.0.1#53
                                  
                                  Non-authoritative answer:
                                  Name:	stackoverflow.com
                                  Address: 151.101.129.69
                                  Name:	stackoverflow.com
                                  Address: 151.101.193.69
                                  Name:	stackoverflow.com
                                  Address: 151.101.65.69
                                  Name:	stackoverflow.com
                                  Address: 151.101.1.69
                                  
                                  

                                  I get the answers and no error.

                                  EDIT:
                                  After these last tests I've added the do-ip6: no option to the resolver. So far, we haven't had any more hiccups. Hoping it mitigates the issue until a proper fix is out at least.

                                  L 1 Reply Last reply Jul 29, 2022, 2:24 AM Reply Quote 0
                                  • L
                                    lohphat @maverickws
                                    last edited by Jul 29, 2022, 2:24 AM

                                    @maverickws said in pfSense resolver stops working:

                                    EDIT:
                                    After these last tests I've added the do-ip6: no option to the resolver. So far, we haven't had any more hiccups. Hoping it mitigates the issue until a proper fix is out at least.

                                    I wonder what's the current tally of those with IPv6 enabled and unbound having issues? I conjectured that unbound is suffering memory/heap issues silently since the updates to it since 22.01 but disabling IPv6 the problem goes away since the memory overhead is reduced.

                                    Methinks we might have a smoking gun to warrant looking at unbound's memory footprint.

                                    SG-3100 24.11-RELEASE (arm) | Avahi (2.2_6) | ntopng (5.6.0_1) | openvpn-client-export (1.9.5) | pfBlockerNG-devel (3.2.1_20) | System_Patches (2.2.20_1)

                                    1 Reply Last reply Reply Quote 0
                                    • M
                                      maverickws
                                      last edited by Jul 29, 2022, 9:18 AM

                                      Well truth is I haven't had this issue since I've added the do-ip6: no option.
                                      Everything's running smoothly, no more failed queries. I'm even considering re-enabling dhcp leases just to see if it really has any impact on this or was all due to that option.

                                      In either case, I also have a pfSense with 22.05 at home and I don't have this issue.
                                      The difference in the setups would be, the datacenter segment where this was happening only has IPv6 enabled locally, while at my place I do have an IPv6 WAN connection.

                                      I don't think it's memory related (could be wrong ofc) but I've never seen the pfSense be nowhere near it's limits either of memory or CPU.

                                      L 1 Reply Last reply Jul 29, 2022, 5:04 PM Reply Quote 0
                                      • G
                                        Gertjan @maverickws
                                        last edited by Jul 29, 2022, 9:45 AM

                                        @maverickws said in pfSense resolver stops working:

                                        And we're all OK with that?

                                        Noop, so I removed the check before "register dhcp leases setting" and be done with it.

                                        See the existing redmine->pfSEnse bug reports about this subject.

                                        Some possible solutions have been mentioned already.
                                        We all wait for that person that is willing to write the code, some others to test it.
                                        The usual development sequence.

                                        Even on 'big' networks with a lot of PC type devices that are always connected, this is (nearly) not noticeable.
                                        But then came the connect disconnect connect disconnect connect disconnect type of device : our smart phone that go out of wifi range, come into wifi range etc. That triggers a new DHCP sequence with the now known side effects. Now you have issues.

                                        And things became worse : the market was flooded with cheap no-brain devices that renew their lease every 7200 seconds, no matter what.
                                        So, it's true : that cheap connected doorbell gadget can really destroy your DNS stability.

                                        @maverickws said in pfSense resolver stops working:

                                        I also have a pfSense at home which is one version behind (22.01), with pfBlockerNG and these issues do not happen.

                                        The behaviour unbound + the dhcpleases process that restarts unbound didn't change for the last 2, 3 years or so. It's a pain, we all agree. But a pain with a "go away" button ;)

                                        If your device @home is a PC, linked up by cable, and asks for a 48 hours lease, it will renew every 24 hours. That's ok.
                                        If your device has a stic IP, it will not initiate a DHCP request == unbound dosn't get restarted by "dhcpleases".
                                        All depends on these kind of details

                                        @maverickws said in pfSense resolver stops working:

                                        I can't imagine if it were hundreds or thousands.

                                        If you need to know the host name (often pure BS like HUAWEI_P30-91b3ex3ab3c5d), that is, you want the "HUAWEI_P30-91b3ex3ab3c5d" in your DNS cache, then yeah, you have an issue.

                                        That's why I added all (the ones I need to know by host name as they have a GUI or something like that) my known home and company devices as static MAC leases.
                                        I had to enter 50+ static leases over the last ....10 years ? - and this works fine for me now.

                                        No "help me" PM's please. Use the forum, the community will thank you.
                                        Edit : and where are the logs ??

                                        M 1 Reply Last reply Jul 29, 2022, 10:08 AM Reply Quote 0
                                        • M
                                          maverickws @Gertjan
                                          last edited by maverickws Jul 29, 2022, 10:09 AM Jul 29, 2022, 10:08 AM

                                          @gertjan thank you for your comment, but unfortunately it seems like it's a bit focused on the dhcp leases option when in truth that option had absolutely nothing to do with it.
                                          I have it disabled for two days (just scroll up I said when it was disabled) and the problem did not cease.

                                          The problem only ceased with the do-ip6: no option.

                                          So, despite understanding your explanation (and even agreeing that there isn't any requirement for enabling the dhcp leases, which are not enabled) it focused on something different.

                                          If your device @home is a PC, linked up by cable, and asks for a 48 hours lease, it will renew every 24 hours. That's ok.

                                          Just to give an idea, I have an office at home where me and the mrs both work, with two desks, 2 computers, a PBX and IP phones, a local server, users devices (phones, smart wearables etc), you still have to account for smart TV's, smart vacuum cleaners, smart scales, smart "different kinds of" alarms, and others I know we have around AT LEAST 20 devices connected at any given moment and I'm thinking I'm counting it under.

                                          The funny thing is that here at home we have absolutely no issue what so ever.

                                          So we must understand I can compare two different segments: Segment A let's say it's home-office, and segment B is the datacenter. So let's compare:

                                          Segment A:

                                          • release 22.05;
                                          • Has WAN IPv6;
                                          • Has pfBlockerNG;
                                          • Register DHCP Leases enabled;
                                          • Has huge lists;
                                          • No issues have been registered.

                                          Segment B:

                                          • release 22.05;
                                          • Doesn't have WAN IPv6, only local;
                                          • Does NOT have ANY extra packages except Service Watchdog;
                                          • Register DHCP Leases disabled;
                                          • Does not have any kind of huge lists;
                                          • Issues occur constantly until the do-ip6: no option is added to the resolver.

                                          I agree with the static lease approach, I do it too. Just that the issue is unrelated because it's not due to that register dhcp lease option.

                                          J G 2 Replies Last reply Jul 29, 2022, 10:26 AM Reply Quote 0
                                          37 out of 66
                                          • First post
                                            37/66
                                            Last post
                                          Copyright 2025 Rubicon Communications LLC (Netgate). All rights reserved.