Thanks for the excellent reply. I've retested as you suggested by entering persistent maintenance and there is no packet loss that way (perst maint, reboot, leave persist maint). I am still having a small problem with freeradius xmlrpc sync between the two but I posted that in a separate topic (see https://forum.pfsense.org/index.php?topic=135864.0).
There aren't any messages in Status > Carp. It does fail over as expected when I use Maintenance Mode.
I've run a packet capture on the LAN interface on the secondary node (10.1.1.2), then unplugged the WAN connection on the primary (10.1.1.1). That gives the following log:
13:08:01.211091 IP 10.1.1.1 > 184.108.40.206: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:02.215013 IP 10.1.1.1 > 220.127.116.11: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:03.219572 IP 10.1.1.1 > 18.104.22.168: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:04.235793 IP 10.1.1.1 > 22.214.171.124: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:05.237226 IP 10.1.1.1 > 126.96.36.199: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:06.248646 IP 10.1.1.1 > 188.8.131.52: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:07.251671 IP 10.1.1.1 > 184.108.40.206: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:07.814455 IP 10.1.1.1 > 220.127.116.11: CARPv2-advertise 36: vhid=11 advbase=1 advskew=240 authlen=7 counter=13725484716165676517
13:08:07.814467 IP 10.1.1.2 > 18.104.22.168: CARPv2-advertise 36: vhid=11 advbase=1 advskew=100 authlen=7 counter=16795567171629106047
13:08:09.237432 IP 10.1.1.2 > 22.214.171.124: CARPv2-advertise 36: vhid=11 advbase=1 advskew=100 authlen=7 counter=16795567171629106047
13:08:09.237741 IP 10.1.1.1 > 126.96.36.199: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:10.260303 IP 10.1.1.1 > 188.8.131.52: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:11.276073 IP 10.1.1.1 > 184.108.40.206: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:12.338015 IP 10.1.1.1 > 220.127.116.11: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:13.400586 IP 10.1.1.1 > 18.104.22.168: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:14.463607 IP 10.1.1.1 > 22.214.171.124: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:15.525578 IP 10.1.1.1 > 126.96.36.199: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:16.558297 IP 10.1.1.1 > 188.8.131.52: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
13:08:17.563721 IP 10.1.1.1 > 184.108.40.206: CARPv2-advertise 36: vhid=11 advbase=1 advskew=0 authlen=7 counter=13725484716165676517
The first few entries cover the initial state where everything is plugged in as normal, and the primary node is advertising itself with advbase 1 and advskew 0.
At 13:08:07 the primary changes to advbase 1 and advskew 240. The secondary replies with advbase 1 and advskew 100, which I guess means it has taken over than LAN VIP as expected.
At 13:08:09 the primary starts advertising with advbase 1 and advskew 0 again, which seems to coincide with the LAN VIP failing over back to the primary while the WAN VIP remains on the secondary.
Each interface will have a physical interface name, sigh as em0, ix1, igb0, re4, bc2.
You can get this in Status > Interfaces
Then in Diagnostics > Command Prompt execute ifconfig em0 substituting em0 for the correct interface name of your WAN and post the output. Please do not sanitize more than the first couple of octets of any addresses.
Also please post a quick WAN pcap of the CARP traffic seen on both nodes. Please set the level of detail to Full.
Yeah, that's what it is supposed to do.
I would set a maintenance window, put the primary in maintenance mode, do what you have to do, and remove it from maintenance mode.
And I'd stop moving cables around.
Hi, I wrote that code some years ago, freely publishing a pfsense customisation I had made for a service provider who had hired me some time before.
As you repeat here, I was surprised in seeing that such a feature (i.e. dns update on a CARP failover) required an ad-hoc script, so reading that another user was looking for the same, I had made it available with some remarks, knowing that it could have been useful later.
Still, as many changes have been introduced in the following pfsense releases, to make that code working again you have (and you will always have, because being a custom patch it will require continuous check/maintenance at every pfsense update, unless it becomes a standard feature as you hope) to:
ensure php is still the current scripting language for pfsense
verify the current release php syntax for the functions required to manipulate strings (I had already slightly modified it for a next pfsense release)
verify the current config.xml structure for setting the configuration keys to enable/disable dynamic dns entries (check the similar code used for the GUI)
verify the current rc.carpmaster/rc.carpbackup (see parameters and structure)
I don't have time for committing into this now, but let me suggest you'll have just to insist with some tests (possibly displaying intermediate string manipulation results) to get to the desired behaviour.
Let me say that even if you defined it just a "kludge", I had always been proud of that smart and quick snippet of code, tailored to solve a specific issue.
As it is your effort to create and maintain it (I really doubt it can be raised to a feature being it so specific), it'll be up to you to decide whether to publish it or keep it for yourself.
Again, more details needed. See above.
"Can't ping out" is a symptom. You need to diagnose to find out what is not in place that is put back when you save the interface. My guess is something like a default gateway. But that's just a guess.
Got it. I wasn't really thinking about it. Thinking about it, you're right. It makes no sense for me to have obfuscated them.
EDIT: Deobfuscated them through all posts.
EDIT 2: So I'm not convinced I've got my problem solved just yet, but it's possible. I reset my pfSense slave to factory defaults and have been reconfiguring it from the ground up. So far DNS is still working, but I still have a handful of interfaces to configure. At this stage, I would expect it to not be working on any interfaces if it was going to have any issues, so I'm hopeful. If this does fix it, I have absolutely no idea what was broke.
@citronvolcano said in Sync captive portal logged in state:
is there a way to sync captive portal logged users between the Master and the Backup?
Not that I know of. Last time I ran an HA captive portal I am pretty sure I told it not to sync the CP settings and just disabled the captive portal on the secondary. In the event of a failover it was better to just allow the traffic than to break 3000 CP sessions all at once.
Yes, there would be a "vulnerability" in that a savvy user could just manually set their gateway to the secondary's interface address and bypass the portal but that was deemed a lesser concern. The access was "free" anyway. The primary reason HA was implemented was keeping the front desk from getting slammed in the event of a failure, which equates to keeping the guests happy.
Each ISP modem is connected to a Layer 2 unmanaged switch, which then one port is connected to one FW the other the other FW.
Different switches per WAN correct?
Each box is identical, except one is Master and the other Backup of course so I know my HA sync is working.
The SYNC interface has nothing to do with the CARP VIP status on each interface or which node is master or backup at any given time.
My problem here is when I have one ISP connected the IP address assigned to the VIP never shows up on the modems ARP table.
The CARP MAC only shows up in the upstream MAC address table due to the CARP advertisements.
When the node holding the CARP MASTER status sees an ARP request for the CARP VIP, it answers with an ARP response. This ARP response is sourced from the interface MAC address but contains the CARP MAC address as the ISAT MAC address.
There is no reason for the modem to contain the CARP VIP in its ARP table unless it needs to route traffic from itself to the CARP address.
That said, MANY ISP devices simply do not do what is necessary for CARP to function correctly. They might only allow one MAC per port or any of a number of silly things.
Some work fine.
Then maybe it is just multicast connectivity.
With both as MASTER you should be able to see the CARP hearbeats from the other node when you capture CARP on VLAN10 or VLAN20. If you only see the hearbeats from the local node you are capturing on, there's your symptom.
CARP VLANs work fine.
Are the CARP VIPs MASTER and BACKUP on the primary and secondary respectively (Status > CARP)?
Did you instruct your DHCP server to give the CARP VIP as the default gateway in its leases?
but it does not work as well
What does "does not work" mean?
ARP responses from the firewalls are always CARP VIP ISAT CARP MAC. But those reponses are sourced from the interface MAC address, not the CARP MAC. The CARP MAC address is included in the ARP ISAT response, not the frame itself.
What steers the traffic to the proper node that holds the CARP MASTER is the fact that the CARP advertisements are sourced from the CARP MAC address. This tells the switching layer what port to send the traffic to. No traffic ever gets sourced from the CARP MAC at layer 2 other than the CARP advertisements.
This is why most CARP problems come down to switching, not pfSense itself.
@derelict It is amazing now I can finally shut down my DELL R210 II and upgrade the memory and remove that 12 TB HDD from there without down time, witch I was planing to do from a very long time
Hmmm, interesting. I based my question on when we had these in VMs and had a VLAN set up for the sync interfaces. Sounds like there was some sort of a problem back then if the list was significantly longer (I want to say a couple dozen). Or my memory is significantly bad. :)
In hindsight, using VMs saved us money in startup costs and was cool to do, but I wish we'd gotten the SG-4860s up front...less hassles over time.
Why would you want to run multiple layer 3 on the same layer 2? Its a Borked Config right out of the gate - are you in the middle of migration from that Huge /16 that makes zero sense to the more reasonable /24?
I found a work around rewriting the client dhclient.conf file, but this is not satisfying.
I guess we will have to externalize our DHCP service from PFsense, probably some dedicated isc dhcpd server with the capacity of understanding that a FQDN shouldn't get forwarded a duplicated domain name... :-(
@vigorfac said in High Avail. Sync broken:
Nov 7 12:40:18 php-fpm 51646 /status_logs_settings.php: The command '/usr/local/sbin/unbound -c /var/unbound/unbound.conf' returned exit code '1', the output was ' unbound[90624:0] error: bind: address already in use  unbound[90624:0] fatal error: could not open ports'
The above error sounds similar to this bug in pfSense, which was since resolved:
https://redmine.pfsense.org/issues/7326#note-2 (the code didn't wait long enough for unbound to stop before trying to start it again...in our case the master server was unaffected but the backup router would end up with unbound not running)
re: HA sync, we have "DNS Forwarder and DNS Resolver configurations" checked in our setup and have no sync issues. So I don't think that by itself is an issue.
Since I opened my mouth I felt obligated to test this tonight. I entered persistent maintenance mode a couple times and did not see issues switching back. So I suppose it might be related to our prior setup.
It didn't happen every time, but I'd say a majority of the time. Then again I seem to recall it happening occasionally just entering and leaving persistent maintenance mode so I don't think it's related to the upgrading process.
The VIPs are lower case and have no leading zero, however the LAN IP is "2607:xxxx:0:4c::1/64 (vhid: 154)" with a lone zero in there. Note it was the WAN IP that got stuck in dual Master (2607:xxxx::12/125 (vhid: 153)).
The reason for the NAT is because its part of a DNS failover.
I got it working like this:
WAN1 IP: 220.127.116.11 NAT'ed to 18.104.22.168
WAN2 IP: 22.214.171.124 NAT'ed to 126.96.36.199
That way i got a WAN failover to the same server.
yes I had the interfaces restricted - I did not want the ntpd to LISTEN on the WAN interface.
Reseting state did not help- same issue.
But attaching ntpd to the WAN interface did the trick.
Now having hybrid NAT and proper ntpd source IP.
I finally found the solution YaY
On Cisco ME3400E the default port-type is UNI and it has to be set to NNI.
From official Cisco config guide:
Traffic is not switched between these ports, and all arriving traffic at UNIs or ENIs must leave on NNIs to prevent a user from gaining access to another user's private network.