pfBlocker CARP node goes into a kind of backup after pfBlocker update. pfb_dsnbl stops.
-
Thanks all.
I have 4 pairs of CARP boxes. One pair is using an IP Alias which I am going to change to CARP.
The other three use the CARP function on pfBlocker, apropriately configured with primary Base 1, skew 0, secondary Base 1, skew 100, allowing the /32 which is always restored after an update. That works, with one very irritating problem. Which I discuss below from another post I made.
Today I found a new wrinkle. So say I change the time of the update of the pfBlocker instance, after the save, the pfBlocker CARP node also performs as mentioned below. This is totally predictable.
One comment. For me Unbound works fine, great even, but when the pfBlocker CARP node fails and the corresponding pfb_dnsbl process stops, the DNS cannot reach the users from that pfSense box and the pfblocker and the secondary box takes over.
Yes, if I disable pfBlocker, all works fine.
Like I mention, editing and saving the VIPs from the Firewall menu fixes the pfBlocker CARP nodes until the next pfBlocker update. I am wondering if there is a process that I can invoke from CRON that I could use to reset the CARP in the same manner?
After a pfBlockerng update, the pfBlocker CARP VIP on the primary does not show MASTER or SECONDARY it is just blank and pfb_dnsbl stops. Then the secondary pfBlocker CARP VIIP takes over.
See screen shots below.
So if I make like I am editing the primary pfBlocker CARP VIP from the firewall menu and just save it the primary pfBlocker CARP VIP becomes MASTER and the secondary one becomes backup.
Then I can start pfb_dnsbl successfully.
Here is something from the general logs.
Jul 26 09:16:52 php 40257 [pfBlockerNG] DNSBL parser daemon started
Jul 26 09:16:52 lighttpd_pfb 39134 [pfBlockerNG] DNSBL Webserver started
Jul 26 09:16:52 lighttpd_pfb 36785 [pfBlockerNG] DNSBL Webserver stopped
Jul 26 09:16:41 kernel carp: 4@igb1: BACKUP -> MASTER (preempting a slower master)
Jul 26 09:16:40 check_reload_status 441 Reloading filter
Jul 26 09:16:40 kernel carp: 4@igb1: INIT -> BACKUP (initialization complete)
Jul 26 09:16:39 php-fpm 97364 /rc.filter_synchronize: XMLRPC reload data success with https://10.1.10.2:443/xmlrpc.php (pfsense.restore_config_section).
Jul 26 09:16:38 php-fpm 97364 /rc.filter_synchronize: Beginning XMLRPC sync data to https://10.1.10.2:443/xmlrpc.php.
Jul 26 09:16:38 php-fpm 97364 /rc.filter_synchronize: XMLRPC versioncheck: 23.3 -- 23.3
Jul 26 09:16:38 php-fpm 97364 /rc.filter_synchronize: XMLRPC reload data success with https://10.1.10.2:443/xmlrpc.php (pfsense.host_firmware_version).The CARP system seems to be fine otherwise. Once I have done the manual intervention things are fine until pfBlocker does its updates again.
Observations? Suggestions? What am I missing?
Thanks
The other machines CARP pair
Jul 26 08:18:16 php-fpm 31382 /rc.carpmaster: HA cluster member "(10.33.10.1@em1): (GREENLAN)" has resumed CARP state "MASTER" for vhid 5
Jul 26 08:18:15 check_reload_status 457 Carp master event
Jul 26 08:18:15 kernel carp: 5@em1: BACKUP -> MASTER (preempting a slower master)
Jul 26 08:18:15 php-fpm 19625 /rc.carpbackup: HA cluster member "(10.33.10.1@em1): (GREENLAN)" has resumed CARP state "BACKUP" for vhid 5
Jul 26 08:18:15 php-fpm 19625 /rc.filter_synchronize: XMLRPC reload data success with https://172.16.1.3:443/xmlrpc.php (pfsense.restore_config_section).
Jul 26 08:18:14 check_reload_status 457 Reloading filter
Jul 26 08:18:14 kernel carp: 5@em1: INIT -> BACKUP (initialization complete)
Jul 26 08:18:14 check_reload_status 457 Carp backup event
Jul 26 08:18:12 php-fpm 19625 /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php.
Jul 26 08:18:12 php-fpm 19625 /rc.filter_synchronize: XMLRPC versioncheck: 23.3 -- 23.3
Jul 26 08:18:12 php-fpm 19625 /rc.filter_sJul 26 08:18:16 php-fpm 31382 /rc.carpmaster: HA cluster member "(10.33.10.1@em1): (GREENLAN)" has resumed CARP state "MASTER" for vhid 5
Jul 26 08:18:15 check_reload_status 457 Carp master event
Jul 26 08:18:15 kernel carp: 5@em1: BACKUP -> MASTER (preempting a slower master)
Jul 26 08:18:15 php-fpm 19625 /rc.carpbackup: HA cluster member "(10.33.10.1@em1): (GREENLAN)" has resumed CARP state "BACKUP" for vhid 5
Jul 26 08:18:15 php-fpm 19625 /rc.filter_synchronize: XMLRPC reload data success with https://172.16.1.3:4443/xmlrpc.php (pfsense.restore_config_section).
Jul 26 08:18:14 check_reload_status 457 Reloading filter
Jul 26 08:18:14 kernel carp: 5@em1: INIT -> BACKUP (initialization complete)
Jul 26 08:18:14 check_reload_status 457 Carp backup event
Jul 26 08:18:12 php-fpm 19625 /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:4443/xmlrpc.php.
Jul 26 08:18:12 php-fpm 19625 /rc.filter_synchronize: XMLRPC versioncheck: 23.3 -- 23.3
Jul 26 08:18:12 php-fpm 19625 /rc.filter_synchronize: XMLRPC reload data success with https://172.16.1.3:4443/xmlrpc.php (pfsense.host_firmware_version).
Jul 26 08:18:12 php-fpm 19625 /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:4443/xmlrpc.php.
Jul 26 08:18:11 php-fpm 70712 /firewall_virtual_ip_edit.php: Beginning configuration backup to https://acb.netgate.com/save
ynchronize: XMLRPC reload data success with https://172.16.1.3:443/xmlrpc.php (pfsense.host_firmware_version).
Jul 26 08:18:12 php-fpm 19625 /rc.filter_synchronize: Beginning XMLRPC sync data to https://172.16.1.3:443/xmlrpc.php.
Jul 26 08:18:11 php-fpm 70712 /firewall_virtual_ip_edit.php: Beginning configuration backup to https://acb.netgate.com/saveDashboard of Primary
Primary pfBlocker CARP Config ...
Primary CARP VIP
Secondary pfBlocker CARP Config ...
Secondary CARP VIP
Update settings on Primary node.
-
This post is deleted! -
I think I may be close to the solution. I had two extended power failures on two of my sites. When the power returned, both of those sites stabilized.
Gee that's just weird right? Yes I had tried to reboot the pfSense systems without success.
I am going to try resetting the switches directly downstream from the CARP pairs that are still giving issues. If that fixes that problem I will write the whole thing up.
It might have to do with Filter Host IDs and the way I originally configured my systems.
-
@reberhar Ok Success! Everything seems happy. Phew!
So how did I manage to goof up such a great system and what was it that I did?
So there is a very long story which I will not tell you.
Suffice it to say that my filter IDs were identical on 3 of the HA pairs. I corrected them and cleared the state tables. The fact that they were identical is part of the story I am not telling you.
I also setup pfBlocker with with CARP and made sure the CARP VHIDs were unique. I setup base and skew numbers appropriately on the primary and secondary machines. Nuts to the "do not edit" message.This was all good and ran perfectly until pfBlocker did its nightly updates. Then I would get the bizarre behavior mentioned above. I noted that it happened as well when I fussed with OpenVPN.
Ok, there is a very nice discussion on the importance of having your switches right on the HA section in the Netgate docs. It has some important challenging information in it.
So yes, I reset my switches downstream of the HA/CARP pairs and 3 of the four pairs were happy, but not the fourth unit. It is a 24 port ubiquiti smart switch. So I moved a little Managed Netgear GS108PE between the pfSense and the Ubiquiti, INSTANT HAPPINESS! The switch was already there. I just moved it up in position.
(Actually a helpful extended power failure reset two of the four switches.)
Ok. I have made enough mistakes on these things. If anybody thinks I goofed here, gee please tell me.
At any rate, DNS now works on failover and all the bizarre behavior that occasioned this post has gone away. I also get to sleep at night.
-
I want to mention that enabling Checksum Offloading in System/Advanced/Networking seems to help with my problem. I have Intel NICs.
CARP seems to be a very CPU critical operation. It is, by design very time sensitive. Removing load from the CPU, especially where the NICs are concerned can't hurt and may help. It does seem to make a difference in my case. A very slight lag can upset the CARP system.
Thus the importance of the switches being on board is important as well.
The more I fuss with this the more I realize how complex HA and CARP really is. It is, however, worth the struggle. It is really quite amazing when it works well.
-
@reberhar What is your ISP bandwidth? Guessing rather high…
It is cool to upgrade pfSense while they are in use. (Backup first, for lurkers)
-
I have 5 sites, but the one I am talking about now has two ISP gateways,
One at 350/350, and the other at 150/150.
Fiber optic, really awesome. Before it was DSL which was a very small fraction of the FO bandwidth.
Yes, it is very nice to update one box and then the other.
"Lurkers" ... very appropriate term. I don't like surprises either, especially when I am remote.
-
This post is deleted! -
This post is deleted! -
@reberhar SUCCESS
After the latest upgrade for pfBlocker I started to have the same problems all over again and none of my other methods fixed it.
I finally got onsite and have learned some useful things.
First I have 2 Netgear GS108PEs, and one worked properly in this situation and the other did not. After thinking about it I realized that the one that functioned had 802.1q VLAN enabled. So I enabled 802.1q VLAN on the one that was not functioning correctly and the problem disappeared. No I didn't make any VLANs on the second unit, although the first unit I mentioned does have them. I just enabled 802.1q VLAN.
I reasoned that perhaps multicast was somehow involved in this. (duh) So I worked through enabling multicast on my Ubiquiti 24 port smart switch that had failed with this challenge earlier. It actually involved the Cloud Key as well.
This I did just on the two ports I am using for HA, not the entire switch.
That worked too and is still working.
Yes I know, multicast is mentioned in the HA diagnostics write up. I guess I was just not following through. Actually, I was just a little unsure how to proceed. I have other very smart switches that have been testy in this pfBlockerng / HA environment. I am excited to try this approach with them.