CARP strange behaviour on all networks

PDJ

anyone?
I really don't know what it could be, didn't find much about this on the forums or on other pages

PDJ

Do I have to report this as a bug?
Since 2.1 we have only problems with the network, we have 14 networks and they all go down after a while.
Do we have to use different passwords for every VID ?

PDJ

I have switched the backup server off, because it was very unstable.
So what should I do? How can I fix this problem?
Anybody?

Does it help to become a gold member?

ssheikh

What does your MBUF usage look like?

nothing

If I were you, I would disconnect all the networks and leave just the WAN and 1 LAN and if this works, start connecting the rest of LANs one by one to see when it fails.

PDJ

Thanks for the answer.

I have done that, but the problem is, with all networks connected it runs for a couple of hours and suddenly it collapse, sometimes after an hour, sometimes after a day.
leaving all the networks disconnected for a day is not an option, that would mean downtime on a lot of services.

@ssheikh: good question, I'll check that.

PDJ

I has been a while, we decided to let it rest for a while and disable CARP

Now we have made a test network with the same hardware and I found out something very strange.
First of all, when the master is down and up again, the slave won't switch back to master.
when I check on the slave when I do a tcpdump I get

IP 192.168.20.252 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36

Funny thing is, that the master is configure as skew 0 instead of 240, where is that 240 comming from?

When I manually set the skew to 250 on the backup machine, I see it switch back to slave and the master becomes master.

But what causing the strange unstable behaviour? and why is the prio set to 240 ?

podilarius

Don't know. I checked mine and it is listed in tcpdump as:
<externalip>> 224.0.0.18: VRRPv2, Advertisement, vrid 124, prio 0, authtype none, intvl 1s, length 36, addrs(7): <removed to="" protect="" privacy="">It does this on all my CARP stuff. I am on 2.1 final, but all my configs are upgrades and not new installs.

drop to console and report the output of this back.
grep -e advskew -e subnet /cf/conf/config.xml</removed></externalip>

PDJ

Thanks for the answer, I found more info it has something to do with preempt, if 1 interface fails, the rest will be set to 240 so all interfaces will switch over (that's not something I prefer, but since 2006 you can't change this, pfsense has enabled this by default)
However in my case, both boxes do the same, result all interfaces have advskew 240 on master and slave, and with 20 carp networks will bring both boxes down because of the constant switching master -> backup -> master….

I have set net.inet.carp.preempt to 0 in the system tunables, but it is not changing.

podilarius

In you backup FW, do you have configuration setting sync turned on?
Personally, if I have one link fail, I would need all to fail over. Mostly this is cause I will need to bring down the master for maintenance. Also cause the WAN died and I don't want any LAN to go to the box where the WAN link failed. If its on of the LAN, sure, its not that big a deal, it will just go out the other WAN port. But you never know.

PDJ

For me it's easier to have only one failover, the setup is so that the slave doesn't have all features (no backup wan connection) so only 1 network doesn't have the failover when there is a network fail.
If all networks will switch in depended, I still can switch the master down, all networks will go down and the slave would take over all networks.

I have created a stable situation again, I found out when there is an open network (both pfsense are set to init, the network becomes unstable in a couple of hours)

But still I want to failover independent, I don't get why the option has been taken out.