DNS Resolver not listening on LAN CARP VIP after update to 2.5.1
-
@luke_71 said in DNS Resolver not listening on LAN CARP VIP after update to 2.5.1:
I'm having problems after the 2.5.1 update with the unbound resolver and at the moment I have worked around them: in my case I am using pfBlocker and an HA setup. This setup was working fine on 2.5.0 (and still is for the pfSense installs in HA I have not yet updated). The only way I can get resolution to work is by checking the "Enable forwarding mode" in the DNS Resolver page so the issue seems connected to something blocking resolution locally. I have no localhost added among the populated list of DNS servers and the "Use local DNS, fall back to remote DNS servers" default option selected. DNS queries from the diagnostic page resolve correctly locally and with upstream DNS servers only when forwarding mode is enabled. If it is not enabled the resolution times out randomly on localhost or upstream servers around 10 seconds and only rarely resolves - a rather awkward behaviour.
If I connect to the CARP IP (not the pfSense LAN IPs) resolution fails (port 53 UDP is not listening). If I connect to the individual pfSense LAN IPs resolution works properly only with DNS forwarding enabled. I ran an nmap -sU and in fact UDP port 53 is not listening on the CARP LAN IP. I edited and saved again the DNS Resolver settings just to be sure it updated with the CARP IP but nothing changed. The Resolver never listens on the CARP IP.
Just to synthesize, the only way I currently am able to make resolution work is by enabling DNS Forwarding mode on the DNS Resolver which allows DNS resolution to work properly only on the pfSense local IPs bypassing the localhost. I feed these IPs to the upstream DNS servers for everything to work with pfBlocker (instead of the CARP IP which should be used instead for proper failover).
If I disable DNS forwarding, little or no resolution takes place on the LAN pfSense IPs (times out) while port 53 IS however listening, indicating again that something is broken on the local resolver side.
In no case is the CARP LAN IP listening on port 53, even after editing and saving DNS Resolver settings or adding the local NET to the ACL (which is of course unnecessary).I won't update my other pfSense installs to 2.5.1 until I can find what is wrong and why
- the DNS resolver is not listening on the LAN CARP IP (192.168.0.254/24)
- DNS resolution (randomly with a 99% chance) fails without Forwarding mode enabled even if listening on LAN IPs.
The rest of the firewall NAT and port forward rules work fine on the other WAN CARP IPs and so does NAT. An additional and maybe relevant note is that the WAN IPs are under Static NAT from the provider on the 10.0.0.0/24 range but apart from obviously disabling the "Block Private Networks" flag in the WAN interface I have never had any issues and doubt this should create any on the LAN side.
Any pointers or heads up are appreciated.
I do have the exact same problem! As soon as I shutdown one of CARP HA nodes everything is working as expected again. Both nodes on, mostly DNS resolver fails. Also random instability issues and "slowness or slugginess" within my whole network.
An Ubuntu node on VLAN w/ both pfSense HA CARP running:
$ ping pfsense.org ping: pfsense.org: Name or service not known
Same Ubuntu node, one of the pfSense HA CARP down:
ping pfsense.org PING pfsense.org (208.123.73.69) 56(84) bytes of data. 64 bytes from 208.123.73.69 (208.123.73.69): icmp_seq=1 ttl=49 time=163 ms 64 bytes from 208.123.73.69 (208.123.73.69): icmp_seq=2 ttl=49 time=161 ms 64 bytes from 208.123.73.69 (208.123.73.69): icmp_seq=3 ttl=49 time=162 ms 64 bytes from 208.123.73.69 (208.123.73.69): icmp_seq=4 ttl=49 time=160 ms ^C --- pfsense.org ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 12188ms rtt min/avg/max/mdev = 160.537/161.794/163.045/1.094 ms
PS. I replicated the same exact same issue on 2.6.0.a.20210526.0100
PS.2 Snort and Suricata are also "broken" on a HA CARP setup. WAN VIP cannot be selected, hence if WAN is selected in either package all network traffic is going to be halted to a grind.
-
@rle Can you verify if you have "ALL" interfaces selected on the DNS resolver on both the Interfaces and Outgoing Interfaces? Also are you running pfBlockerNG_devel v3.0.0.16 and if so is it in CARP (not single VIP) mode?
In my case, to overcome the intermittent DNS Resolver issue, I specifically selected LAN, SYNC, <CARP GW> and LOCALHOST on the incoming interface and WAN and <CARP GW> on the Outgoing (I need the CARP GW iface also on the outgoing requests as I have some local DNS overrides to resolve machine IPs to AD DNS but this is not necessary if you have no internal DNS lookup overrides). Once I removed the pfBlocker CARP IP from the listening IPs I could successfully remove the manual "interface <CARP IP>" directive which I had added and unbound would listen on the CARP IP automagically. In fact, if I left that directive it would complain about a double specified IP, which is correct and how it should behave.
The random failing of the DNS requests is because there is confusion created between the 2 CARP enabled machines and/or the local pfBlocker CARP interface. Once the skew is properly set on the second pfBlocker IP node things should go back to normal. Check the skew on your em0 / em1 CARP interfaces using ifconfig and don't rely on the web iface as it's not consistent with runtime settings (just saved statuses) with these kinds of faults.
-
DNS Resolver
-
Network interfaces
-- Not all interfaces, only LAN IPv4+IPv6, LAN related CARP VIP+pfB DNSBL and localhost. -
Outgoing network interfaces
-- Only 'CARP WAN VIP' -
Make use of 'Host Overrides' options for my internal network.
pfBlockerNG
Installed pfBlockerNG_devel v3.0.0.16. But I thought I configured it in CARP mode however....DNSBL VIP Type is IP Alias. Going to change that right away.Additional notes
When pfSense 2.5.1 was released I finally switched over to pfSense HA CARP. Coming from a single instance. All was well, except more than once in a while CARP demotion error messages and some minor weird stability/stagnation issues in my network occurred. More often than what I was used to when using only a single instance of pfSense.After configuring just a couple of VLANs I've noticed strange behavior. Within the same VLAN and every other moment and successively a couple of computer nodes (Ubuntu) would not be able to update. That frequency went up and became random across all nodes.
Since 2,5 weeks ago I began researching the heck out of my network, VLANs rules, pfSense configuration, switches, NIC settings. To no avail. I knew it had to do with unbound versus CARP because of 'ping' and 'nslookup'. They only worked randomly. Couldn't rule out the rest yet of my network config. My non-VLAN / regular LAN interfaces were working as expected though, not as stable and fast as before pfSense 2.5.x.
Today my primary node WAN VIP became backup again with no clue whatsoever and the secondary WAN VIP the master. I rebooted the secondary pfSense node and voila the VLAN computer node suddenly worked after 2 to 3 seconds. That got me to this forum post.
Maybe my issue is related to something else, but I suspect not. For the first time in years my confidence in pfSense has dropped considerably. This was a roller coaster up until now.
Hardware:
- 2x Supermicro X10SDV-6C-TLN4F (Xeon D-1521 w/ (32 GB RAM, 256 GB NVMe, Add-On Card dual SFP+ Intel X710-BM2 from FS.com).
Dedicated SYNC interface connected to each other via Cat.7A cable.
Report asap back if know more after testing, e.g. ifconfig. Thanks for your hints! Appreciated.
-
-
@rle That would explain the DNS Resolver failures when both nodes were up - once you set it to CARP mode it will automatically configure the relative iface from VIP to CARP VIP on the primary node after the reload; you have to "Force Reload" so it will also sync it to the secondary but there is a bug as it syncs the 0 SKEW setting over which it should not. So before forcing a reload on the secondary node, run to the pfBlocker DNSBL settings and set that skew to 100 and then force a reload on the secondary (it will set it on the pfBlocker CARP VIP once you run it). Beware as if you don't do that all sorts of weird sh*t can happen on your primary iface where the pfBlocker CARP VIP is associated (check what happened in my case). If it does in the mean time overwrite it simply set it back to 100 in the pfblocker settings and the CARP VIP iface skew and save. If any related IP has died you can just enter the CARP settings and save again so it will rewrite them back, or just do it manually using ifconfig (I suggest you check with ifconfig anyhow the runtime CARP and SKEW settings on the secondary node and confirm they are valid (Backup -> 100 advskew). None should every be in INIT state. I've been bashing my head for nearly 2 weeks on this before I realized that that damn skew was being overwritten on every reload causing the BSD CARP to crap out - kept me running around in circles for some time. For the moment just keep in mind that on every reload / config change on the primary that 0 skew value will be rewritten over to the secondary so until the pfBlocker source is modified with a +100 adv skew value on every SYNC to any other node of that value it's going to fail you.
-
I have been banging my head for the last 6 hours. So fed up to be honest.
When I enabled the CARP mode of pfBlockerNG, my complete network went crashing down the rabbit hole. Played again with various settings: especially unbound mode vs unbound python mode, resetting states, ifconfig, you name it....
My conclusion is that pfSense High Availability CARP w/ pfBlockerNG/unbound (and what about IPS/IDS?) is simply not up to par anymore nowadays. This use case is unfit for (business) production use.
I think Netgate has a very interesting dilemma and challenge with pfSense/FreeBSD and in keeping up programming/dependency wise versus up to par features.
Therefore, I'm going back to a single instance of pfSense with a much broader solid battle field tested base in combination with an old fashioned strategy of a good backup with a spare node. Downtime is going to be far less prevalent than what I'm experiencing now with HA CARP.
-
@rle I understand your frustration (been there) and see what you mean. This is why I never update to the latest release on a production environment and rather do this "locally" where I can handle the disruption(s) easily. What you experienced is exactly what I have, verbatim, and I assure you it's caused by the CARP skew issue on the secondary node. If you are doing this in a production environment then avoid it and stick to a single node for now. The workaround is what I described above, but you need to manually intervene after every full sync and keep that skew difference monitored as it will bring your primary iface (thus network) down. It's a major fail yes, but easily fixable. I hope the developer will update the source ASAP.
-
@rle said in DNS Resolver not listening on LAN CARP VIP after update to 2.5.1:
I have been banging my head for the last 6 hours. So fed up to be honest.
When I enabled the CARP mode of pfBlockerNG, my complete network went crashing down the rabbit hole. Played again with various settings: especially unbound mode vs unbound python mode, resetting states, ifconfig, you name it....
My conclusion is that pfSense High Availability CARP w/ pfBlockerNG/unbound (and what about IPS/IDS?) is simply not up to par anymore nowadays. This use case is unfit for (business) production use.
I think Netgate has a very interesting dilemma and challenge with pfSense/FreeBSD and in keeping up programming/dependency wise versus up to par features.
Therefore, I'm going back to a single instance of pfSense with a much broader solid battle field tested base in combination with an old fashioned strategy of a good backup with a spare node. Downtime is going to be far less prevalent than what I'm experiencing now with HA CARP.
This is pfBlockerNG bug:
https://redmine.pfsense.org/issues/11964Use the "IP Alias" VIP type or wait for the fix.
-
@viktor_g That will still cause intermittent DNS failures and there will be 2 identical active IPs on 2 different hosts. Also, the resolver won't listen on the CARP GW IP. See above.
-
@viktor_g @Luke_71 Thanks for the feedback. In hindsight I'm going to wait it out (tired and tired of problem solving). So I've just shut down the secondary node, pulled the SYNC cable and disabled CARP for the time being. This is IMHO panning out as an acceptable temporarily solution until a fix comes along.
Will keep you posted.
-
@luke_71
Please install the System Patches package:
https://docs.netgate.com/pfsense/en/latest/development/system-patches.htmland apply Patch https://github.com/pfsense/FreeBSD-ports/pull/1071/commits/96abc00bba758dddebc09611300ac4680dc0fc5a
Then run pfBlockerNG Force restart
-
@viktor_g Unfortunately I got some error messages.
Status update:
--> Path Strip Count must be set to 4 instead of 2 (duh). Patch applied.Error Message
-
If I apply the CARP mode to pfBlockerNG I get:
Status / System Logs / System / General
May 28 01:46:44 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:44 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:29 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:29 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:27 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:20 dhcpleases 36267 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory.
and Unbound + pfb_dnsbl service will not start at all regardless of DNSBL Mode.
Only the DNSBL VIP Type = IP Alias works for me.
In other words: I cannot properly test the patch unfortunately.
-
@viktor_g I confirm the patch works properly, the skew is no longer overwritten with a force reload on both nodes and if the resulting added value (100) is over 254 it reverts to max 254.
One additional observation: per the pfSense HA CARP guide, the CARP VIPs should have the same subnet as the main interface:
https://docs.netgate.com/pfsense/en/latest/troubleshooting/high-availability.html
Incorrect Subnet Mask
The real subnet mask must be used for a CARP VIP, not /32. This must match the subnet mask for the IP address on the interface to which the CARP IP is assigned.I am certain that for "local" ifaces /32 is ok for pfBlocker, but shouldn't the subnet be something else (=matching the assigned iface subnet) other than /32 when pfBlocker is configured in CARP VIP mode based on the above assumption or am I reading this incorrectly?
Thanks for the patch.
-
@rle can you check the logs and see what the issues are with unbound? I only select LAN, localhost and CARP VIP for listening ifaces (don't select ALL) and WAN on outbound interface (plus LAN CARP VIP for local domain overrides). Remember to Force reload all first primary then secondary in pfBlocker Update after changing to CARP mode in DNSBL.
-
@luke_71 @viktor_g It seems that I had a (huge) misconfiguration with unbound. My knowledge is not up to par...
Apologies for my rant a couple of posts back. Can't seem to change/edit it however.
Only issue now is that pfb_dnsbl/pfBlockerNG DNSBL service is not starting at all. CARP issues are gone.
Huge thanks to both of you for your help and quick release of the patch!
Tested it on pfSense 2.6.0.a.20210527.0100
-
@rle I have no issues with pfBlockerNG but I'm on 2.5.1 / 3.0.0_16 + patch. I can only suggest you check the logs after having run a full reload on both nodes. Be sure that the unbound service is running without issues and that the DNSBL webserver config has no conflicting ports on the LAN interface.