CARP triggered on the BACKUP only without obvious reason

awebster

In this thread https://forum.pfsense.org/index.php?topic=90758.0, Jimp gives some good advice, which I had also stumbled upon by trial and error.

Set ADVBASE higher on the backup.

Infact, there is a clue to this in the DHCP server Failover peer configuration

Leave blank to disable. Enter the interface IP address of the other machine. 
Machines must be using CARP. Interface's advskew determines whether the DHCPd process is Primary or Secondary. 
Ensure one machine's advskew<20 (and the other is >20).

bennyc

hmm, interesting tip. When I increase in the BACKUP for a certain CARP the Base setting to anything higher (like "22"), after a while (I assume a next sync?) it returns to value "1".
And this leads also to the issue that my dhcpd failover states are in error. Until I change it back. And then after a while we go back to "1". (manual loop ::)

So either I'm doing it wrong, or there is something wrong?

awebster

@bennyc:

hmm, interesting tip. When I increase in the BACKUP for a certain CARP the Base setting to anything higher (like "22"), after a while (I assume a next sync?) it returns to value "1".
And this leads also to the issue that my dhcpd failover states are in error. Until I change it back. And then after a while we go back to "1". (manual loop ::)

So either I'm doing it wrong, or there is something wrong?

Sorry, I forgot one critical detail.

You must uncheck Synchronize Vitual IPs in the System -> High Avail. Sync, otherwise the MASTER will keep overwriting the ADVBASE value.
This also means you must manually configure the VIP address on each box, initially check the Synchronize Vitual IPs when you do the setup, then uncheck it to go into production, and never check it again.

I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.

bennyc

First of all; thanks for your hints here!

Ok, that makes sense. In meantime, this is driving me nuts.
I disabled the sync for dhcpd. Then modified each CARP IP on the backup with a Adv. of 22. Then I had one failover group, that for no obvious reason remained in unknown state. Reboot, went back to settings, compaired dhcpd.conf from each member, they seemed ok.

Restarted the dhcpd from the backup, all groups on my Master reported in recover, and on my backup they are all (but one) normal :(
Then restarted the dhcpd on the master, seems all groups on my Master are stuck in recover >:(

sigh… ::)

awebster

You are certain there are no duplicate CARP IDs or HSRP/VRRP sessions using the same CARP IDs?
From shell, use tcpdump -s 0 -n -i <interface>-T carp proto 112
Check each interface individually to be sure.</interface>

bennyc

Went back to my initial state (full sync incl VIPs), removed even the failover ip's from dhcp's.

Removed sync for CARP again. Then I updated the VIP CARP Adv. to 22 on the backup.

Started the packet capture as requested. Set the FailOver IP again on the problematic interface (only here). Immediately the dhcp failover group shows on the backup, but also immediately in state "recover" and peer-state "unknown-state"

I went over my config, I only have 7 CARP IP's, and they each have a unique hvid. Did a capture on both nodes, on the IF which is giving me troubles, and I only see CARPv2-advertise 36 for the configured vhid:
22:50:58.137661 IP 10.100.200.251 > 224.0.0.18: CARPv2-advertise 36: vhid=200 advbase=1 advskew=0 authlen=7 counter=10858211938054085703

Changed the vhid just to be sure, repeated the test (remove failover, add failover), same result.

I must add, it justed dawned to me this IF happens to be the one connected on a Cisco stack with hsrp configured on it. It seems now a good idea for me to review it's config…

bennyc

@awebster

Tnx for all hints here, solved the issue in meantime.
It required a reboot of the Master to get things running again, and a config error from my part for that last dhcp group.

bennyc

@awebster:

I guess this is a bug, because ADVBASE, and SKEW should not be overwritten on sync.

Perhaps silly question, but did you report this? Seem indeed unwanted behavior that could be fixed as a bug…

awebster

Thanks, here it is: https://redmine.pfsense.org/issues/5528

bennyc

Great! (& tnx!) I would have done it otherwise, but credits go to you…