How should it run ?

Juve

Hi all,

I''ve set up a cluster between two 1.0.1 pfsense boxes and I have a (stupid) question… How should it run in a normal behaviour ? I mean when a master falls, the slave takes over the master....ok but when the master comes up after the fail should the master be the master.... like it was at the begining ?

I'm saying that because it is not the case in my setup. All seems allright (different vhids, good passwords, good link), when I shutdown the master the slave takes over it like it should and everything is working good. But, when the master comes up, the slave doesn't give up the CARP IPs even after 30minutes (all is still working good ofcourse no traffic problems).In fact I must click "disable carp" on the slave node to make the master get his master status.... is this normal ?

Thx for your answers.

hoba

Did you set up the clusternodes manually or did you use the config sync to setup the backup system? In case you did set it up manually did you choose different advertising frequencies? The Master system should have a lower advertising frequency than the slave (like 0 Master and 100 Backup). This way the system with the lower advertising frequency will become master if available. However the configsync does raise that automatically when syncing over the VIPs.

Juve

I used config_sync wich did set the backup at 100 on the slave. Because it wasn't working like expected I disabled the config_sync for VIPs entries (unchecked VIPs checkbox on Carp Settings page) and did a manual config on the backup host by changing their advertising frequency to something different (WAN : 133, LAN 115, DMZ 186). On the master both are in master mode : Advertising frequency to 0

hoba

You might have understood something wrong. You should have all CARP VIPs at the master at 0 and all VIPs at the backup at 100 for example. Don't use different advertising frequencies for CARP VIPs at the same machine.

Juve

Ok so I dit what you said, I came back to carp setting and reactivated the config_sync for virtual IPs.
Clicking save has forced a sync.

when clicking save the node 2 has taken over the node 1 and 25 minutes after this it is still the same. here is capture showin both firewall having an advskew at 200:

carp.png_thumb

Juve

no ideas ?

Do you think I should reinstall it and then restoring my current configuration ?

Thx.

hoba

After activating CARP the advertising frequency will be set to 200 for a bit before it goes to the originally set value. This had to be done to work around a bug in CARP. However as your machines don't seem to do this something seems to stop before it does that and I don't have a clue what prevents this from finishing completely. :(

Juve

I noticed that the /tmp/carp.sh contains advskews at 200.
If it was running fine, after the few seconds (or minutes) at 200 this file would have been changed with the advskew at the set value ?

sullrich

Yes.

Juve

Do you have an Idea of how I could determine what is killing this process ? What should I monitore at startup ? name of a script or something like that.

Many thx for your help guys

sullrich

There should be a php process running with rc.bootup .. It should wait 120 seconds before it exits.

Juve

Ok I monitored it and after the 120 seconds it stopped like it should (I saw no errors or exception) but my CARP IPs were still at 200 on each box. The (weird) fact is that they are really able to communicate since they take over each other….

The only thing which is a bit "strange" on that boxes is that I'm natting on WAN interface packets sourced by WAN interface to a public IP (Most likely ntp requests, pfsense packages retrieval and download...) because my WAN lies in a private network. The public network is handled by the DMZ and my ISP router is configured to forward that public range through pfSense. Do you think that NAT rule should break the way it should work?

sullrich

And the master's ADVSKEW is set to what again?

Juve

Master : 0
Slave 100

configuration sync enabled between firewalls

master.GIF_thumb

slave.GIF_thumb

settings.GIF_thumb

sullrich

Hrm. Wish that I could reproduce this…

jakehathaway

I am seeing the same issue. I have 3 carp addresses, lan, wan, qmoe plus the pfsense internal address. All items seem to sync just fine. I created the backup from the master. I have all 3 master carps at 0 and all 3 backup carps at 100. sometimes only 1 or 2 of the carps failover and it just holds them. I have to re-save carp settings on backup or reboot backup pf box to get it to fail back.
I would love to send any configs, or debug logs if I can do something to help you see the issue. Please let me know.
Currently my boxes are not in a production environment so now is the prime time to debug.
Thanks.

Juve

Have you checked if your switches are not blocking CARP traffic ?
Just to be sure….

sullrich

Switches are constantly an issue with CARP it seems. Definitely ensure that its not being blocked/stopped at the switch level.

jakehathaway

I don't have my pfsync interfaces plugged into a switch, they are plugged in with a crossover cable to each other.

sullrich

CARP != pfSync. CARP traffic will still be present on all interfaces that have a CARP address assigned. If they cannot communicate then it will not work.