DNS Resolver not listening on LAN CARP VIP after update to 2.5.1

Luke_71

I'm having problems after the 2.5.1 update with the unbound resolver and at the moment I have worked around them: in my case I am using pfBlocker and an HA setup. This setup was working fine on 2.5.0 (and still is for the pfSense installs in HA I have not yet updated). The only way I can get resolution to work is by checking the "Enable forwarding mode" in the DNS Resolver page so the issue seems connected to something blocking resolution locally. I have no localhost added among the populated list of DNS servers and the "Use local DNS, fall back to remote DNS servers" default option selected. DNS queries from the diagnostic page resolve correctly locally and with upstream DNS servers only when forwarding mode is enabled. If it is not enabled the resolution times out randomly on localhost or upstream servers around 10 seconds and only rarely resolves - a rather awkward behaviour.

If I connect to the CARP IP (not the pfSense LAN IPs) resolution fails (port 53 UDP is not listening). If I connect to the individual pfSense LAN IPs resolution works properly only with DNS forwarding enabled. I ran an nmap -sU and in fact UDP port 53 is not listening on the CARP LAN IP. I edited and saved again the DNS Resolver settings just to be sure it updated with the CARP IP but nothing changed. The Resolver never listens on the CARP IP.

Just to synthesize, the only way I currently am able to make resolution work is by enabling DNS Forwarding mode on the DNS Resolver which allows DNS resolution to work properly only on the pfSense local IPs bypassing the localhost. I feed these IPs to the upstream DNS servers for everything to work with pfBlocker (instead of the CARP IP which should be used instead for proper failover).

If I disable DNS forwarding, little or no resolution takes place on the LAN pfSense IPs (times out) while port 53 IS however listening, indicating again that something is broken on the local resolver side.
In no case is the CARP LAN IP listening on port 53, even after editing and saving DNS Resolver settings or adding the local NET to the ACL (which is of course unnecessary).

I won't update my other pfSense installs to 2.5.1 until I can find what is wrong and why

the DNS resolver is not listening on the LAN CARP IP (192.168.0.254/24)
DNS resolution (randomly with a 99% chance) fails without Forwarding mode enabled even if listening on LAN IPs.

The rest of the firewall NAT and port forward rules work fine on the other WAN CARP IPs and so does NAT. An additional and maybe relevant note is that the WAN IPs are under Static NAT from the provider on the 10.0.0.0/24 range but apart from obviously disabling the "Block Private Networks" flag in the WAN interface I have never had any issues and doubt this should create any on the LAN side.

Any pointers or heads up are appreciated.

Luke_71

As an update, I managed at least to fix the non listening LAN CARP IP by manually adding it under the DNS Resolver General settings Custom Options (interface: 192.168.0.254 in my case).

However, as soon as I uncheck the Enable forwarding mode the unbound server queries start failing and only seldomly resolves so something must be broken.

viktor_g

@luke_71 said in DNS Resolver not listening on LAN CARP VIP after update to 2.5.1:

As an update, I managed at least to fix the non listening LAN CARP IP by manually adding it under the DNS Resolver General settings Custom Options (interface: 192.168.0.254 in my case).

However, as soon as I uncheck the Enable forwarding mode the unbound server queries start failing and only seldomly resolves so something must be broken.

How I can reproduce it step-by-step?

Works fine on 2.5.1 or 2.6 builds for me -
I can successfully select CARP VIP and receive DNS responses from it

Luke_71

@viktor_g Thanks for your feedback. I can give you further details of my setup and steps I took which may aid in the pinpointing of the culprit.
Originally this was a single pfSense install (2.4.5) with pfBlocker_devel, arpwatch, nmap and openvmtools with no issues running in a VM on vSphere 6.7 (host is ESXi 6.0 U3) through dvSwitch and 3 VIPs. I then upgraded to 2.5.0 and subsequently 2.5.1 adding a second 2.5.1 pfSense on another host in the same cluster in HA switching to CARP VIPs - but this second install is not causing the main issue as the problem is related to the LAN CARP IP which I created once I spawned the second pfSense, changing the "physical" LAN IP to .253 (instead of .254) and assigning .254 as CARP VIP (2nd pfSense is .252). Here DNS problems started, intermittent resolution failing (99% of the time) but NAT and the rest of the rules were working fine on all CARP VIPs (including failover to the 2nd node). Initially, due to the sporadic resolution problems I investigated the dvSwitch and Vlan setup but everything was ok (promiscous mode enabled, other CARP VIPs working fine on same vlan) so I ruled out a network issue. As additional info, our 4 public WAN IPs are Static NATed to the private range (10.0.0.x/24, from .1 to .4) from our provider router so the 4 WAN CARP IPs are also in the private range (though the issue is on the LAN side). At this stage after several tests and reboots I noticed 2 repeatable issues:
1) the LAN CARP VIP was never listening on port 53 (UDP/TCP) so unbound somehow wasn't "seeing" that IP to listen on even if "ALL" interfaces was selected and
2) by default without Forwarding mode enabled the resolver was mostly failing (99% of the time) if queried on the physical IP (.253).
I am at loss as what else to try and I cannot see why unbound fails direct resolution 99% of the time and doesn't by default listen on the CARP IP if I don't specify the interface IP manually in the custom options (but that's easily resolvable). The verbose logs don't reveal any errors. Could this be an upgrade/heredity issue? Maybe still some CARP issue with my setup (though why would NAT and all else work)? Any ideas are welcome - thanks again.

Luke_71

***Update: I did some failover tests and there is another strange issue: when the 1st node is forced into persistent CARP maintenance mode, the second node loses it's assigned primary LAN IP (!) and there are 2 null CARP VIPs. em0 looks like this when working (note the inexistant vhid):

em0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=81009b<RXCSUM,TXCSUM,VLAN_MTU,VLAN_HWTAGGING,VLAN_HWCSUM,VLAN_HWFILTER>
ether 00:50:56:87:dd:2f
inet6 fe80::250:56ff:fe87:dd2f%em0 prefixlen 64 scopeid 0x1
inet 192.168.0.252 netmask 0xffffff00 broadcast 192.168.0.255
inet 192.168.0.254 netmask 0xffffff00 broadcast 192.168.0.255 vhid 254
inet 10.254.254.254 netmask 0xffffffff broadcast 10.254.254.254 vhid 250
carp: INIT vhid 10 advbase 1 advskew 100
carp: INIT vhid 6 advbase 1 advskew 0 <-- ?? I have no VHID 6!
carp: MASTER vhid 6 advbase 1 advskew 100
carp: MASTER vhid 250 advbase 1 advskew 0
media: Ethernet autoselect (1000baseT <full-duplex>)
status: active
nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>

But when I enter persistent maintenance mode the inet 192.168.0.252 is removed (!) and the ip is 0.0.0.1. I created a temporary VHID 6 on the LAN interface and then removed it, and now it's gone but when I failover the .252 local IP still just gets replaced with 0.0.0.1. Somehow the CARP configuration is messed up - very strange things going on here.

Do you know the path to the CARP config file? Is it possible to edit it? I saw that there are rc.

Luke_71

***Latest update
I changed the pfBlocker interface from CARP to VIP and forced update all on both firewalls. This successfully updated and removed the CARP interface replacing it with a standard VIP in the Virtual IPs (as per default). This subsequently removed the CARP and VHID assignment on the em0 which was causing the em0 main IP interface to fall back to 0.0.0.1 and lose it's assigned one (while in the Interfaces panel it was still correctly indicated). Now both firewalls failover correctly and the second one doesn't lose it's IP. Evidently something was wrong with the way pfBlocker setup it's failover CARP interface on the second node. I can't say what as I double checked both firewall XML configuration files and even compared them with my other similar 2.5.0 setups and found no inconsistencies.

The only difference is that with v2.5.0 and the pfBlocker interface in CARP mode everything works while with 2.5.1 something fails in CARP mode and the LAN interface on the second node loses it's IP (!) reverting to 0.0.0.1 while in standard VIP mode it works as CARP mode is avoided.

In any case one has to add "interface: <LAN CARP IP>" under the "server:" section to the Custom options in the DNS Resolver or unbound will not listen on the LAN CARP IP even if you specify it under the Network Interfaces section or select both "All" and your LAN CARP IP.

I am still experiencing a 95% DNS resolver failure rate (DNS request time outs) unless I enable DNS Query Forwarding.

Luke_71

I tried several times disabling and re-enabling the CARP option in pfBlocker and every time it's enabled, the main IP (not the CARP VIP!) get's "lost" - this is the ifconfig of em1:

CARP fault.jpg

As you can see, the inet 0.0.0.1 should be 192.168.0.252 /24 and has NO VHID (it's not a CARP interface!) but somehow "inherits" the pfBlocker one and the pfBlocker CARP iface has the proper but same VHID (which it should). All the other CARP interfaces are fine including the .254 on the same em0.

Any thoughts? Can anyone reproduce this? Should I chuck the second pfSense and start from scratch (though I would like to find out what is causing this very strange behaviour).

Luke_71

@viktor_g Just wanted to let you know the problem was with the pfBlocker XMLRPC SYNC: it is also synching the SKEW value of the pfBlocker interface to the 2nd node which it should not (should remain more than the primary or 100 as default). Every complete reload/sync the CARP VIP is updated with a value of 0 hence it crashes shortly after. I posted this also in the pfBlockerNG group for clarity.

viktor_g

@luke_71 redmine issue created:
https://redmine.pfsense.org/issues/11964

rle

@luke_71 said in DNS Resolver not listening on LAN CARP VIP after update to 2.5.1:

I'm having problems after the 2.5.1 update with the unbound resolver and at the moment I have worked around them: in my case I am using pfBlocker and an HA setup. This setup was working fine on 2.5.0 (and still is for the pfSense installs in HA I have not yet updated). The only way I can get resolution to work is by checking the "Enable forwarding mode" in the DNS Resolver page so the issue seems connected to something blocking resolution locally. I have no localhost added among the populated list of DNS servers and the "Use local DNS, fall back to remote DNS servers" default option selected. DNS queries from the diagnostic page resolve correctly locally and with upstream DNS servers only when forwarding mode is enabled. If it is not enabled the resolution times out randomly on localhost or upstream servers around 10 seconds and only rarely resolves - a rather awkward behaviour.

If I connect to the CARP IP (not the pfSense LAN IPs) resolution fails (port 53 UDP is not listening). If I connect to the individual pfSense LAN IPs resolution works properly only with DNS forwarding enabled. I ran an nmap -sU and in fact UDP port 53 is not listening on the CARP LAN IP. I edited and saved again the DNS Resolver settings just to be sure it updated with the CARP IP but nothing changed. The Resolver never listens on the CARP IP.

Just to synthesize, the only way I currently am able to make resolution work is by enabling DNS Forwarding mode on the DNS Resolver which allows DNS resolution to work properly only on the pfSense local IPs bypassing the localhost. I feed these IPs to the upstream DNS servers for everything to work with pfBlocker (instead of the CARP IP which should be used instead for proper failover).

If I disable DNS forwarding, little or no resolution takes place on the LAN pfSense IPs (times out) while port 53 IS however listening, indicating again that something is broken on the local resolver side.
In no case is the CARP LAN IP listening on port 53, even after editing and saving DNS Resolver settings or adding the local NET to the ACL (which is of course unnecessary).

I won't update my other pfSense installs to 2.5.1 until I can find what is wrong and why

the DNS resolver is not listening on the LAN CARP IP (192.168.0.254/24)

DNS resolution (randomly with a 99% chance) fails without Forwarding mode enabled even if listening on LAN IPs.

The rest of the firewall NAT and port forward rules work fine on the other WAN CARP IPs and so does NAT. An additional and maybe relevant note is that the WAN IPs are under Static NAT from the provider on the 10.0.0.0/24 range but apart from obviously disabling the "Block Private Networks" flag in the WAN interface I have never had any issues and doubt this should create any on the LAN side.

Any pointers or heads up are appreciated.

I do have the exact same problem! As soon as I shutdown one of CARP HA nodes everything is working as expected again. Both nodes on, mostly DNS resolver fails. Also random instability issues and "slowness or slugginess" within my whole network.

An Ubuntu node on VLAN w/ both pfSense HA CARP running:

$ ping pfsense.org
ping: pfsense.org: Name or service not known

Same Ubuntu node, one of the pfSense HA CARP down:

ping pfsense.org
PING pfsense.org (208.123.73.69) 56(84) bytes of data.
64 bytes from 208.123.73.69 (208.123.73.69): icmp_seq=1 ttl=49 time=163 ms
64 bytes from 208.123.73.69 (208.123.73.69): icmp_seq=2 ttl=49 time=161 ms
64 bytes from 208.123.73.69 (208.123.73.69): icmp_seq=3 ttl=49 time=162 ms
64 bytes from 208.123.73.69 (208.123.73.69): icmp_seq=4 ttl=49 time=160 ms
^C
--- pfsense.org ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 12188ms
rtt min/avg/max/mdev = 160.537/161.794/163.045/1.094 ms

PS. I replicated the same exact same issue on 2.6.0.a.20210526.0100

PS.2 Snort and Suricata are also "broken" on a HA CARP setup. WAN VIP cannot be selected, hence if WAN is selected in either package all network traffic is going to be halted to a grind.

Luke_71

@rle Can you verify if you have "ALL" interfaces selected on the DNS resolver on both the Interfaces and Outgoing Interfaces? Also are you running pfBlockerNG_devel v3.0.0.16 and if so is it in CARP (not single VIP) mode?

In my case, to overcome the intermittent DNS Resolver issue, I specifically selected LAN, SYNC, <CARP GW> and LOCALHOST on the incoming interface and WAN and <CARP GW> on the Outgoing (I need the CARP GW iface also on the outgoing requests as I have some local DNS overrides to resolve machine IPs to AD DNS but this is not necessary if you have no internal DNS lookup overrides). Once I removed the pfBlocker CARP IP from the listening IPs I could successfully remove the manual "interface <CARP IP>" directive which I had added and unbound would listen on the CARP IP automagically. In fact, if I left that directive it would complain about a double specified IP, which is correct and how it should behave.

The random failing of the DNS requests is because there is confusion created between the 2 CARP enabled machines and/or the local pfBlocker CARP interface. Once the skew is properly set on the second pfBlocker IP node things should go back to normal. Check the skew on your em0 / em1 CARP interfaces using ifconfig and don't rely on the web iface as it's not consistent with runtime settings (just saved statuses) with these kinds of faults.

rle

@luke_71

DNS Resolver

Network interfaces
-- Not all interfaces, only LAN IPv4+IPv6, LAN related CARP VIP+pfB DNSBL and localhost.
Outgoing network interfaces
-- Only 'CARP WAN VIP'
Make use of 'Host Overrides' options for my internal network.

pfBlockerNG
Installed pfBlockerNG_devel v3.0.0.16. But I thought I configured it in CARP mode however....DNSBL VIP Type is IP Alias. Going to change that right away.

Additional notes
When pfSense 2.5.1 was released I finally switched over to pfSense HA CARP. Coming from a single instance. All was well, except more than once in a while CARP demotion error messages and some minor weird stability/stagnation issues in my network occurred. More often than what I was used to when using only a single instance of pfSense.

After configuring just a couple of VLANs I've noticed strange behavior. Within the same VLAN and every other moment and successively a couple of computer nodes (Ubuntu) would not be able to update. That frequency went up and became random across all nodes.

Since 2,5 weeks ago I began researching the heck out of my network, VLANs rules, pfSense configuration, switches, NIC settings. To no avail. I knew it had to do with unbound versus CARP because of 'ping' and 'nslookup'. They only worked randomly. Couldn't rule out the rest yet of my network config. My non-VLAN / regular LAN interfaces were working as expected though, not as stable and fast as before pfSense 2.5.x.

Today my primary node WAN VIP became backup again with no clue whatsoever and the secondary WAN VIP the master. I rebooted the secondary pfSense node and voila the VLAN computer node suddenly worked after 2 to 3 seconds. That got me to this forum post.

Maybe my issue is related to something else, but I suspect not. For the first time in years my confidence in pfSense has dropped considerably. This was a roller coaster up until now.

Hardware:

2x Supermicro X10SDV-6C-TLN4F (Xeon D-1521 w/ (32 GB RAM, 256 GB NVMe, Add-On Card dual SFP+ Intel X710-BM2 from FS.com).

Dedicated SYNC interface connected to each other via Cat.7A cable.

Report asap back if know more after testing, e.g. ifconfig. Thanks for your hints! Appreciated.

Luke_71

@rle That would explain the DNS Resolver failures when both nodes were up - once you set it to CARP mode it will automatically configure the relative iface from VIP to CARP VIP on the primary node after the reload; you have to "Force Reload" so it will also sync it to the secondary but there is a bug as it syncs the 0 SKEW setting over which it should not. So before forcing a reload on the secondary node, run to the pfBlocker DNSBL settings and set that skew to 100 and then force a reload on the secondary (it will set it on the pfBlocker CARP VIP once you run it). Beware as if you don't do that all sorts of weird sh*t can happen on your primary iface where the pfBlocker CARP VIP is associated (check what happened in my case). If it does in the mean time overwrite it simply set it back to 100 in the pfblocker settings and the CARP VIP iface skew and save. If any related IP has died you can just enter the CARP settings and save again so it will rewrite them back, or just do it manually using ifconfig (I suggest you check with ifconfig anyhow the runtime CARP and SKEW settings on the secondary node and confirm they are valid (Backup -> 100 advskew). None should every be in INIT state. I've been bashing my head for nearly 2 weeks on this before I realized that that damn skew was being overwritten on every reload causing the BSD CARP to crap out - kept me running around in circles for some time. For the moment just keep in mind that on every reload / config change on the primary that 0 skew value will be rewritten over to the secondary so until the pfBlocker source is modified with a +100 adv skew value on every SYNC to any other node of that value it's going to fail you.

rle

@luke_71

I have been banging my head for the last 6 hours. So fed up to be honest.

When I enabled the CARP mode of pfBlockerNG, my complete network went crashing down the rabbit hole. Played again with various settings: especially unbound mode vs unbound python mode, resetting states, ifconfig, you name it....

My conclusion is that pfSense High Availability CARP w/ pfBlockerNG/unbound (and what about IPS/IDS?) is simply not up to par anymore nowadays. This use case is unfit for (business) production use.

I think Netgate has a very interesting dilemma and challenge with pfSense/FreeBSD and in keeping up programming/dependency wise versus up to par features.

Therefore, I'm going back to a single instance of pfSense with a much broader solid battle field tested base in combination with an old fashioned strategy of a good backup with a spare node. Downtime is going to be far less prevalent than what I'm experiencing now with HA CARP.

Luke_71

@rle I understand your frustration (been there) and see what you mean. This is why I never update to the latest release on a production environment and rather do this "locally" where I can handle the disruption(s) easily. What you experienced is exactly what I have, verbatim, and I assure you it's caused by the CARP skew issue on the secondary node. If you are doing this in a production environment then avoid it and stick to a single node for now. The workaround is what I described above, but you need to manually intervene after every full sync and keep that skew difference monitored as it will bring your primary iface (thus network) down. It's a major fail yes, but easily fixable. I hope the developer will update the source ASAP.

viktor_g

@rle said in DNS Resolver not listening on LAN CARP VIP after update to 2.5.1:

@luke_71

I have been banging my head for the last 6 hours. So fed up to be honest.

When I enabled the CARP mode of pfBlockerNG, my complete network went crashing down the rabbit hole. Played again with various settings: especially unbound mode vs unbound python mode, resetting states, ifconfig, you name it....

My conclusion is that pfSense High Availability CARP w/ pfBlockerNG/unbound (and what about IPS/IDS?) is simply not up to par anymore nowadays. This use case is unfit for (business) production use.

I think Netgate has a very interesting dilemma and challenge with pfSense/FreeBSD and in keeping up programming/dependency wise versus up to par features.

Therefore, I'm going back to a single instance of pfSense with a much broader solid battle field tested base in combination with an old fashioned strategy of a good backup with a spare node. Downtime is going to be far less prevalent than what I'm experiencing now with HA CARP.

This is pfBlockerNG bug:
https://redmine.pfsense.org/issues/11964

Use the "IP Alias" VIP type or wait for the fix.

Luke_71

@viktor_g That will still cause intermittent DNS failures and there will be 2 identical active IPs on 2 different hosts. Also, the resolver won't listen on the CARP GW IP. See above.

rle

@viktor_g @Luke_71 Thanks for the feedback. In hindsight I'm going to wait it out (tired and tired of problem solving). So I've just shut down the secondary node, pulled the SYNC cable and disabled CARP for the time being. This is IMHO panning out as an acceptable temporarily solution until a fix comes along.

Will keep you posted.

viktor_g

@luke_71
Please install the System Patches package:
https://docs.netgate.com/pfsense/en/latest/development/system-patches.html

and apply Patch https://github.com/pfsense/FreeBSD-ports/pull/1071/commits/96abc00bba758dddebc09611300ac4680dc0fc5a

Then run pfBlockerNG Force restart

see https://redmine.pfsense.org/issues/11964#note-1

rle

@viktor_g Unfortunately I got some error messages.

Status update:
--> Path Strip Count must be set to 4 instead of 2 (duh). Patch applied.

Error Message