DNS Resolver not listening on LAN CARP VIP after update to 2.5.1
-
DNS Resolver
-
Network interfaces
-- Not all interfaces, only LAN IPv4+IPv6, LAN related CARP VIP+pfB DNSBL and localhost. -
Outgoing network interfaces
-- Only 'CARP WAN VIP' -
Make use of 'Host Overrides' options for my internal network.
pfBlockerNG
Installed pfBlockerNG_devel v3.0.0.16. But I thought I configured it in CARP mode however....DNSBL VIP Type is IP Alias. Going to change that right away.Additional notes
When pfSense 2.5.1 was released I finally switched over to pfSense HA CARP. Coming from a single instance. All was well, except more than once in a while CARP demotion error messages and some minor weird stability/stagnation issues in my network occurred. More often than what I was used to when using only a single instance of pfSense.After configuring just a couple of VLANs I've noticed strange behavior. Within the same VLAN and every other moment and successively a couple of computer nodes (Ubuntu) would not be able to update. That frequency went up and became random across all nodes.
Since 2,5 weeks ago I began researching the heck out of my network, VLANs rules, pfSense configuration, switches, NIC settings. To no avail. I knew it had to do with unbound versus CARP because of 'ping' and 'nslookup'. They only worked randomly. Couldn't rule out the rest yet of my network config. My non-VLAN / regular LAN interfaces were working as expected though, not as stable and fast as before pfSense 2.5.x.
Today my primary node WAN VIP became backup again with no clue whatsoever and the secondary WAN VIP the master. I rebooted the secondary pfSense node and voila the VLAN computer node suddenly worked after 2 to 3 seconds. That got me to this forum post.
Maybe my issue is related to something else, but I suspect not. For the first time in years my confidence in pfSense has dropped considerably. This was a roller coaster up until now.
Hardware:
- 2x Supermicro X10SDV-6C-TLN4F (Xeon D-1521 w/ (32 GB RAM, 256 GB NVMe, Add-On Card dual SFP+ Intel X710-BM2 from FS.com).
Dedicated SYNC interface connected to each other via Cat.7A cable.
Report asap back if know more after testing, e.g. ifconfig. Thanks for your hints! Appreciated.
-
-
@rle That would explain the DNS Resolver failures when both nodes were up - once you set it to CARP mode it will automatically configure the relative iface from VIP to CARP VIP on the primary node after the reload; you have to "Force Reload" so it will also sync it to the secondary but there is a bug as it syncs the 0 SKEW setting over which it should not. So before forcing a reload on the secondary node, run to the pfBlocker DNSBL settings and set that skew to 100 and then force a reload on the secondary (it will set it on the pfBlocker CARP VIP once you run it). Beware as if you don't do that all sorts of weird sh*t can happen on your primary iface where the pfBlocker CARP VIP is associated (check what happened in my case). If it does in the mean time overwrite it simply set it back to 100 in the pfblocker settings and the CARP VIP iface skew and save. If any related IP has died you can just enter the CARP settings and save again so it will rewrite them back, or just do it manually using ifconfig (I suggest you check with ifconfig anyhow the runtime CARP and SKEW settings on the secondary node and confirm they are valid (Backup -> 100 advskew). None should every be in INIT state. I've been bashing my head for nearly 2 weeks on this before I realized that that damn skew was being overwritten on every reload causing the BSD CARP to crap out - kept me running around in circles for some time. For the moment just keep in mind that on every reload / config change on the primary that 0 skew value will be rewritten over to the secondary so until the pfBlocker source is modified with a +100 adv skew value on every SYNC to any other node of that value it's going to fail you.
-
I have been banging my head for the last 6 hours. So fed up to be honest.
When I enabled the CARP mode of pfBlockerNG, my complete network went crashing down the rabbit hole. Played again with various settings: especially unbound mode vs unbound python mode, resetting states, ifconfig, you name it....
My conclusion is that pfSense High Availability CARP w/ pfBlockerNG/unbound (and what about IPS/IDS?) is simply not up to par anymore nowadays. This use case is unfit for (business) production use.
I think Netgate has a very interesting dilemma and challenge with pfSense/FreeBSD and in keeping up programming/dependency wise versus up to par features.
Therefore, I'm going back to a single instance of pfSense with a much broader solid battle field tested base in combination with an old fashioned strategy of a good backup with a spare node. Downtime is going to be far less prevalent than what I'm experiencing now with HA CARP.
-
@rle I understand your frustration (been there) and see what you mean. This is why I never update to the latest release on a production environment and rather do this "locally" where I can handle the disruption(s) easily. What you experienced is exactly what I have, verbatim, and I assure you it's caused by the CARP skew issue on the secondary node. If you are doing this in a production environment then avoid it and stick to a single node for now. The workaround is what I described above, but you need to manually intervene after every full sync and keep that skew difference monitored as it will bring your primary iface (thus network) down. It's a major fail yes, but easily fixable. I hope the developer will update the source ASAP.
-
@rle said in DNS Resolver not listening on LAN CARP VIP after update to 2.5.1:
I have been banging my head for the last 6 hours. So fed up to be honest.
When I enabled the CARP mode of pfBlockerNG, my complete network went crashing down the rabbit hole. Played again with various settings: especially unbound mode vs unbound python mode, resetting states, ifconfig, you name it....
My conclusion is that pfSense High Availability CARP w/ pfBlockerNG/unbound (and what about IPS/IDS?) is simply not up to par anymore nowadays. This use case is unfit for (business) production use.
I think Netgate has a very interesting dilemma and challenge with pfSense/FreeBSD and in keeping up programming/dependency wise versus up to par features.
Therefore, I'm going back to a single instance of pfSense with a much broader solid battle field tested base in combination with an old fashioned strategy of a good backup with a spare node. Downtime is going to be far less prevalent than what I'm experiencing now with HA CARP.
This is pfBlockerNG bug:
https://redmine.pfsense.org/issues/11964Use the "IP Alias" VIP type or wait for the fix.
-
@viktor_g That will still cause intermittent DNS failures and there will be 2 identical active IPs on 2 different hosts. Also, the resolver won't listen on the CARP GW IP. See above.
-
@viktor_g @Luke_71 Thanks for the feedback. In hindsight I'm going to wait it out (tired and tired of problem solving). So I've just shut down the secondary node, pulled the SYNC cable and disabled CARP for the time being. This is IMHO panning out as an acceptable temporarily solution until a fix comes along.
Will keep you posted.
-
@luke_71
Please install the System Patches package:
https://docs.netgate.com/pfsense/en/latest/development/system-patches.htmland apply Patch https://github.com/pfsense/FreeBSD-ports/pull/1071/commits/96abc00bba758dddebc09611300ac4680dc0fc5a
Then run pfBlockerNG Force restart
-
@viktor_g Unfortunately I got some error messages.
Status update:
--> Path Strip Count must be set to 4 instead of 2 (duh). Patch applied.Error Message
-
If I apply the CARP mode to pfBlockerNG I get:
Status / System Logs / System / General
May 28 01:46:44 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:44 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:29 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:29 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:27 dhcpleases 63735 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory. May 28 01:46:20 dhcpleases 36267 Could not deliver signal HUP to process because its pidfile (/var/run/unbound.pid) cannot be read, No such file or directory.
and Unbound + pfb_dnsbl service will not start at all regardless of DNSBL Mode.
Only the DNSBL VIP Type = IP Alias works for me.
In other words: I cannot properly test the patch unfortunately.
-
@viktor_g I confirm the patch works properly, the skew is no longer overwritten with a force reload on both nodes and if the resulting added value (100) is over 254 it reverts to max 254.
One additional observation: per the pfSense HA CARP guide, the CARP VIPs should have the same subnet as the main interface:
https://docs.netgate.com/pfsense/en/latest/troubleshooting/high-availability.html
Incorrect Subnet Mask
The real subnet mask must be used for a CARP VIP, not /32. This must match the subnet mask for the IP address on the interface to which the CARP IP is assigned.I am certain that for "local" ifaces /32 is ok for pfBlocker, but shouldn't the subnet be something else (=matching the assigned iface subnet) other than /32 when pfBlocker is configured in CARP VIP mode based on the above assumption or am I reading this incorrectly?
Thanks for the patch.
-
@rle can you check the logs and see what the issues are with unbound? I only select LAN, localhost and CARP VIP for listening ifaces (don't select ALL) and WAN on outbound interface (plus LAN CARP VIP for local domain overrides). Remember to Force reload all first primary then secondary in pfBlocker Update after changing to CARP mode in DNSBL.
-
@luke_71 @viktor_g It seems that I had a (huge) misconfiguration with unbound. My knowledge is not up to par...
Apologies for my rant a couple of posts back. Can't seem to change/edit it however.
Only issue now is that pfb_dnsbl/pfBlockerNG DNSBL service is not starting at all. CARP issues are gone.
Huge thanks to both of you for your help and quick release of the patch!
Tested it on pfSense 2.6.0.a.20210527.0100
-
@rle I have no issues with pfBlockerNG but I'm on 2.5.1 / 3.0.0_16 + patch. I can only suggest you check the logs after having run a full reload on both nodes. Be sure that the unbound service is running without issues and that the DNSBL webserver config has no conflicting ports on the LAN interface.