Navigation

    Netgate Discussion Forum
    • Register
    • Login
    • Search
    • Categories
    • Recent
    • Tags
    • Popular
    • Users
    • Search

    DNS Resolver causes a kernel panic reboot loop

    DHCP and DNS
    2
    8
    1036
    Loading More Posts
    • Oldest to Newest
    • Newest to Oldest
    • Most Votes
    Reply
    • Reply as topic
    Log in to reply
    This topic has been deleted. Only users with topic management privileges can see it.
    • K
      kereno last edited by

      Hi,

      Has anybody experienced a kernel panic reboot loop caused by Unbound/DNS Resolver lately? I'm running the latest pfSense version 2.4.2-RELEASE-p1 (amd64).

      Here's an excerpt from the log file:

      <118>Starting DNS Resolver…

      Fatal trap 12: page fault while in kernel mode
      cpuid = 2; apic id = 04
      fault virtual address  = 0x0
      fault code      = supervisor read data, page not present
      instruction pointer  = 0x20:0xffffffff80ea706e
      stack pointer          = 0x28:0xfffffe0109cdca50
      frame pointer          = 0x28:0xfffffe0109cdca60
      code segment      = base 0x0, limit 0xfffff, type 0x1b
              = DPL 0, pres 1, long 1, def32 0, gran 1
      processor eflags  = interrupt enabled, resume, IOPL = 0
      current process      = 12 (swi1: pfsync)

      I have two configuration files and one of them is causing the issue. I'm currently trying to find out what causes this bug by applying one change at a time.  I'm suspecting unbound. Here's an excerpt from my "broken" config file:

      <unbound><active_interface>opt2,opt3,opt4,opt5,lo0</active_interface>
      <outgoing_interface>opt6</outgoing_interface>

      <enable></enable>

      <system_domain_local_zone_type>transparent</system_domain_local_zone_type>
      <msgcachesize>50</msgcachesize>
      <outgoing_num_tcp>10</outgoing_num_tcp>
      <incoming_num_tcp>10</incoming_num_tcp>
      <edns_buffer_size>4096</edns_buffer_size>
      <num_queries_per_thread>512</num_queries_per_thread>
      <jostle_timeout>200</jostle_timeout>
      <cache_max_ttl>86400</cache_max_ttl>
      <cache_min_ttl>0</cache_min_ttl>
      <infra_host_ttl>900</infra_host_ttl>
      <infra_cache_numhosts>10000</infra_cache_numhosts>
      <unwanted_reply_threshold>disabled</unwanted_reply_threshold>
      <log_verbosity>1</log_verbosity>
      <regdhcpstatic></regdhcpstatic></unbound>

      In the "working" config file, <msgcachesize>is at 4 and the last four options are replaced by<forwarding></forwarding>. If unbound is not the source of the problem, it has to do with DNS NAT/firewall rules, which I will need to further investigate.

      If anybody have insights on this, please let me know.</msgcachesize>

      1 Reply Last reply Reply Quote 0
      • K
        kereno last edited by

        It looks like the issue is related to a weird interaction between dnsmasq and unbound. I wanted to use dnsmasq to resolve lookups on a "public" network while unbound would have been used as an authoritative for local (and VPN name resolutions). This way, names not resolved within my domain would not have propagated to the root notes for resolution. My configuration was similar to the one described here: https://nguvu.org/pfsense/pfsense-baseline-setup/

        However, it turns out that something is wrong with dnsmasq+unbound at reboot. At some point, pfSense gets out of the kernel panic reboot loop by itself, but it can take several minutes, and I just can't rely on this.

        For now, I just decided to revert back to having only unbound.

        1 Reply Last reply Reply Quote 0
        • johnpoz
          johnpoz LAYER 8 Global Moderator last edited by

          What is the point of his opendns setup.. Seems completely pointless… Degree of privacy from your isp?  But sending everything to opendns.. So they know everything your looking up ;)

          Why would you not just resolve through the vpn?  The zone is set static, so if you look up say novalidhostname.local.lan it does not get resolved.. So there is no "leak" of host names that do not exist - rolleyes ;)  Or that your using local.lan - rolleyes again that roots would just send back NX on anyway.

          I have mine set for static not so much as "privacy" but to just being nice.. If I have something borked looking for something.local.lan that doesn't exist no reason trying to resolve it..

          His is only forwarding the PTR zone to resolver.. So pfsense can not even resolve your own zone... And when asking for something.local.lan its going to ask opendns..

          Sorry but that guide is a mess when it comes to how dns should be setup.. Be it tinfoil hat or not... Pfsense can not resolve your own devices in that setup, and your leaking your own names to opendns - how is that better than your isp?  And not in vpn so your isp would see the traffic anyway ;)

          As to your issue - you prob did not change the listen port so you have a race condition where unbound and dnsmasq both trying to listen on 53..

          An intelligent man is sometimes forced to be drunk to spend time with his fools
          If you get confused: Listen to the Music Play
          Please don't Chat/PM me for help, unless mod related
          SG-4860 23.01 | Lab VMs CE 2.6, 2.7

          1 Reply Last reply Reply Quote 0
          • K
            kereno last edited by

            Johnpoz, I'm not here to criticize the guy. He's probably part of this community and he's one of the rare guys willing to invest time in writting detailed tutorials.

            As to the DNS port, I was forwarding it to 5353 for dnsmasq while keeping 53 for unbound, so you got it wrong there. ;) However, I came up to the same conclusion as yours. Dnsmasq and unbound probably get in a race condition at boot up, for whatever reason it is, and the kernel gets panicked.  Whether it's a proper manipulation or not, I think that this is called a bug.  ;)

            1 Reply Last reply Reply Quote 0
            • johnpoz
              johnpoz LAYER 8 Global Moderator last edited by

              If he was part of the community he would be posting his stuff here and be up for review, etc.

              No he is just some guy that wants to draw traffic to his site because of the popularity of pfsense..

              Lets see your logs where your services come up on the different ports 5353, and 53… If the services were not trying to use the same port then there wouldn't be a race condition.  Do you have them both trying to to register dhcp?

              You can for sure run both as long as they do not conflict with each other trying to do something or use the same ports. But there is really no reason to do such a thing in his scenario of a setup.

              An intelligent man is sometimes forced to be drunk to spend time with his fools
              If you get confused: Listen to the Music Play
              Please don't Chat/PM me for help, unless mod related
              SG-4860 23.01 | Lab VMs CE 2.6, 2.7

              1 Reply Last reply Reply Quote 0
              • K
                kereno last edited by

                Johnpoz, don't be too hard on him. ;) We all share this same passion for pfSense. :)

                As to my logs, they are unfortunately long gone since I had to do a clean install to get out of this kernel reboot loop issue.

                Regarding DHCP, I had a domain override 100.168.192.in-addr.arpa in dnsmasq with IP address 192.168.100.1 as the authoritative DNS server for this domain. This domain referred to the only VLAN that had been selected in dnsmasq's network interfaces (lets say VLAN100 for the example). As to unbound, VLAN100 had not been selected in the network interfaces. All VLANs had their own subnet which were serviced by DHCP server. Both unbound and dnsmasq had their own independent outgoing network interfaces (gateways).

                I don't see how unbound and dnsmasq would get in a race condition (not just port related), unless it is has to do with the code. I might give a shot at this configuration again in future pfSense revisions.

                1 Reply Last reply Reply Quote 0
                • johnpoz
                  johnpoz LAYER 8 Global Moderator last edited by

                  There could be a race condition if they both want to bind to the same port..  Do you have a vpn in play - yes you do from whole setup there is a vpn at work..  So here

                  https://redmine.pfsense.org/issues/6186

                  Maybe your vpn was not coming up, etc.  My point of the race condition is that during boot if for whatever reason something takes longer time A vs time B or 2 things want to list on port X depending on order of boot if one takes longer time A vs time B then things could happen.. This is race condition.. Where in scenario A your fine and things work, but if B happens then your broke.. Its just a race each time to see who wins, etc.

                  An intelligent man is sometimes forced to be drunk to spend time with his fools
                  If you get confused: Listen to the Music Play
                  Please don't Chat/PM me for help, unless mod related
                  SG-4860 23.01 | Lab VMs CE 2.6, 2.7

                  1 Reply Last reply Reply Quote 0
                  • K
                    kereno last edited by

                    You got it there mate. There's definitely a race condition between the VPN and the DNS services at boot up.  :-\ When one of the two DNS services is silenced, everything is fine. Once the race condition happens and I let the the system reboot for several times, it gets out of the loop after some time.

                    I have coded in assembly for several years and you cannot let this happen, never. Process priorities need to be taken care of, otherwise everything gets broken and it's a mess to troubleshoot. That's why low-level IRQs have always had different priorities. In higher level coding, these basic rules are sometimes left behind at the profit of faster deployment. This is definitely something that must be worked out in future pfSense versions.

                    Besides that, I have to admit that pfsense kicks arse!  ;)

                    1 Reply Last reply Reply Quote 0
                    • First post
                      Last post