DNS Resolver causes a kernel panic reboot loop

kereno

Hi,

Has anybody experienced a kernel panic reboot loop caused by Unbound/DNS Resolver lately? I'm running the latest pfSense version 2.4.2-RELEASE-p1 (amd64).

Here's an excerpt from the log file:

<118>Starting DNS Resolver…

Fatal trap 12: page fault while in kernel mode
cpuid = 2; apic id = 04
fault virtual address = 0x0
fault code = supervisor read data, page not present
instruction pointer = 0x20:0xffffffff80ea706e
stack pointer = 0x28:0xfffffe0109cdca50
frame pointer = 0x28:0xfffffe0109cdca60
code segment = base 0x0, limit 0xfffff, type 0x1b
= DPL 0, pres 1, long 1, def32 0, gran 1
processor eflags = interrupt enabled, resume, IOPL = 0
current process = 12 (swi1: pfsync)

I have two configuration files and one of them is causing the issue. I'm currently trying to find out what causes this bug by applying one change at a time. I'm suspecting unbound. Here's an excerpt from my "broken" config file:

<unbound><active_interface>opt2,opt3,opt4,opt5,lo0</active_interface>
<outgoing_interface>opt6</outgoing_interface>

<enable></enable>

<system_domain_local_zone_type>transparent</system_domain_local_zone_type>
<msgcachesize>50</msgcachesize>
<outgoing_num_tcp>10</outgoing_num_tcp>
<incoming_num_tcp>10</incoming_num_tcp>
<edns_buffer_size>4096</edns_buffer_size>
<num_queries_per_thread>512</num_queries_per_thread>
<jostle_timeout>200</jostle_timeout>
<cache_max_ttl>86400</cache_max_ttl>
<cache_min_ttl>0</cache_min_ttl>
<infra_host_ttl>900</infra_host_ttl>
<infra_cache_numhosts>10000</infra_cache_numhosts>
<unwanted_reply_threshold>disabled</unwanted_reply_threshold>
<log_verbosity>1</log_verbosity>
<regdhcpstatic></regdhcpstatic></unbound>

In the "working" config file, <msgcachesize>is at 4 and the last four options are replaced by<forwarding></forwarding>. If unbound is not the source of the problem, it has to do with DNS NAT/firewall rules, which I will need to further investigate.

If anybody have insights on this, please let me know.</msgcachesize>

kereno

It looks like the issue is related to a weird interaction between dnsmasq and unbound. I wanted to use dnsmasq to resolve lookups on a "public" network while unbound would have been used as an authoritative for local (and VPN name resolutions). This way, names not resolved within my domain would not have propagated to the root notes for resolution. My configuration was similar to the one described here: https://nguvu.org/pfsense/pfsense-baseline-setup/

However, it turns out that something is wrong with dnsmasq+unbound at reboot. At some point, pfSense gets out of the kernel panic reboot loop by itself, but it can take several minutes, and I just can't rely on this.

For now, I just decided to revert back to having only unbound.

johnpoz

What is the point of his opendns setup.. Seems completely pointless… Degree of privacy from your isp? But sending everything to opendns.. So they know everything your looking up ;)

Why would you not just resolve through the vpn? The zone is set static, so if you look up say novalidhostname.local.lan it does not get resolved.. So there is no "leak" of host names that do not exist - rolleyes ;) Or that your using local.lan - rolleyes again that roots would just send back NX on anyway.

I have mine set for static not so much as "privacy" but to just being nice.. If I have something borked looking for something.local.lan that doesn't exist no reason trying to resolve it..

His is only forwarding the PTR zone to resolver.. So pfsense can not even resolve your own zone... And when asking for something.local.lan its going to ask opendns..

Sorry but that guide is a mess when it comes to how dns should be setup.. Be it tinfoil hat or not... Pfsense can not resolve your own devices in that setup, and your leaking your own names to opendns - how is that better than your isp? And not in vpn so your isp would see the traffic anyway ;)

As to your issue - you prob did not change the listen port so you have a race condition where unbound and dnsmasq both trying to listen on 53..

kereno

Johnpoz, I'm not here to criticize the guy. He's probably part of this community and he's one of the rare guys willing to invest time in writting detailed tutorials.

As to the DNS port, I was forwarding it to 5353 for dnsmasq while keeping 53 for unbound, so you got it wrong there. ;) However, I came up to the same conclusion as yours. Dnsmasq and unbound probably get in a race condition at boot up, for whatever reason it is, and the kernel gets panicked. Whether it's a proper manipulation or not, I think that this is called a bug. ;)

johnpoz

If he was part of the community he would be posting his stuff here and be up for review, etc.

No he is just some guy that wants to draw traffic to his site because of the popularity of pfsense..

Lets see your logs where your services come up on the different ports 5353, and 53… If the services were not trying to use the same port then there wouldn't be a race condition. Do you have them both trying to to register dhcp?

You can for sure run both as long as they do not conflict with each other trying to do something or use the same ports. But there is really no reason to do such a thing in his scenario of a setup.

kereno

Johnpoz, don't be too hard on him. ;) We all share this same passion for pfSense. :)

As to my logs, they are unfortunately long gone since I had to do a clean install to get out of this kernel reboot loop issue.

Regarding DHCP, I had a domain override 100.168.192.in-addr.arpa in dnsmasq with IP address 192.168.100.1 as the authoritative DNS server for this domain. This domain referred to the only VLAN that had been selected in dnsmasq's network interfaces (lets say VLAN100 for the example). As to unbound, VLAN100 had not been selected in the network interfaces. All VLANs had their own subnet which were serviced by DHCP server. Both unbound and dnsmasq had their own independent outgoing network interfaces (gateways).

I don't see how unbound and dnsmasq would get in a race condition (not just port related), unless it is has to do with the code. I might give a shot at this configuration again in future pfSense revisions.

johnpoz

There could be a race condition if they both want to bind to the same port.. Do you have a vpn in play - yes you do from whole setup there is a vpn at work.. So here

https://redmine.pfsense.org/issues/6186

Maybe your vpn was not coming up, etc. My point of the race condition is that during boot if for whatever reason something takes longer time A vs time B or 2 things want to list on port X depending on order of boot if one takes longer time A vs time B then things could happen.. This is race condition.. Where in scenario A your fine and things work, but if B happens then your broke.. Its just a race each time to see who wins, etc.

kereno

You got it there mate. There's definitely a race condition between the VPN and the DNS services at boot up. :-\ When one of the two DNS services is silenced, everything is fine. Once the race condition happens and I let the the system reboot for several times, it gets out of the loop after some time.

I have coded in assembly for several years and you cannot let this happen, never. Process priorities need to be taken care of, otherwise everything gets broken and it's a mess to troubleshoot. That's why low-level IRQs have always had different priorities. In higher level coding, these basic rules are sometimes left behind at the profit of faster deployment. This is definitely something that must be worked out in future pfSense versions.

Besides that, I have to admit that pfsense kicks arse! ;)