CARP strange behaviour on all networks

PDJ

I have a very strange problem with CARP
I have 2 firewalls with 14 networks (not all in use) all configured CARP (attached a schematic view)

When I setup VIPS on the WAN side they are working for about 15 minutes, after that the public IPS are not accessible from the outside (but they are pingable from the inside)
Yesterday when I did a failover test on all networks, it worked fine, FW2 took over without problems, when FW1 cam back up, some networks we master on FW1 and some on FW2
But after a half an hour the complete network collapsed (all phisical networks were affected) FW1 was not pingable, CAR VIP adress was not pingable, FW2 was pingable but not reachable and everything was down, tried to log in the console of FW1 but it was very slow.
When I gave FW1 a reboot, FW2 took over and everything worked fine again, but 10 minutes after FW1 came up the network collapsed again, after that I gave FW2 a reboot, FW1 took over and the network was reachable again, after FW2 came up everything kept on working, but as long as FW1 stays as master there is no problem, when FW2 will become master on a couple of networks, it will collapse
FW1 and FW2 are identical hardware wise (only difference is the CPU FW1 has a core2duo 6420, FW2 has a 6400 core2duo)
Network adapters are 4 ports intel NICs and 2 onboard (intel server board S5000)

FW1 runs 2.1RC1 and FW2 runs 2.1RC2
Switches are mostely HP managable and a couple HP unmagable (but should switch the broadcast messages just fine)
We checked for loops, there are non and all networks are seperated

IPs on the inside are mostly in 192.168.x.x range where FW1 has 192.168.x.253, FW2 192.168.x.252 and VIP 192.168.x.254 almost all networks are /24 except

The VIPS on the WAN side never worked for more then 15 minutes (also not with 2.1 final)
When I use aliases on FW1 it works fine (but I loose failover on those aliases)

Anyone an idea?

Drawing1.jpg_thumb

PDJ

I have some additional information.
We had a new collaps, both FW1 and FW2 were freezed (didn't respond to their keyboard) when I disconnected the sync, both boxes came to life again and the network started working again.

So what could go wrong here? even that that both boxes completely freezed on their console?

Another obsurved the slave switches to master (master unavalable) it won't switch back when the master is available again

GruensFroeschli

This sounds to me like you have some kind of packetstorm.
Are you using VLANs?
Configured a bridge?
Connected something somewhere to "save hardware"?

PDJ

All answers are No,

we do not have VLANs, no bridges configured and every subnet has it's own phisical ethernet adapter (port)
at the moment I have dissabled CARP on the slave because when enabled the network is unstable.
I have to say we were on 2.0 before, we had some problems with the WAN VIPs but it worked just fine, the real collapse started with 2.1

I have checked the switches during outage, but there were no very high loads on any port (from the one I have checked, don't have much time to check when all networks are down)

This is what the logfile showed (on the master):
Sep 13 14:13:01 kernel: opt12_vip12: link state changed to UP
Sep 13 14:13:01 kernel: lan_vip1: link state changed to UP
Sep 13 14:13:01 kernel: opt4_vip7: link state changed to UP
Sep 13 14:13:01 kernel: opt6_vip9: link state changed to UP
Sep 13 14:13:01 kernel: opt7_vip15: link state changed to UP
Sep 13 14:13:01 kernel: opt2_vip6: link state changed to UP
Sep 13 14:13:01 kernel: opt1_vip5: link state changed to UP
Sep 13 14:13:01 kernel: wan_vip3: link state changed to UP
Sep 13 14:13:01 kernel: opt11_vip11: link state changed to UP
Sep 13 14:13:01 kernel: opt5_vip8: link state changed to UP
Sep 13 14:13:34 kernel: opt7_vip15: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt7_vip15: link state changed to DOWN
Sep 13 14:13:34 kernel: opt4_vip7: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt4_vip7: link state changed to DOWN
Sep 13 14:13:34 kernel: opt6_vip9: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt6_vip9: link state changed to DOWN
Sep 13 14:13:34 kernel: opt11_vip11: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt11_vip11: link state changed to DOWN
Sep 13 14:13:34 kernel: opt2_vip6: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt1_vip5: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt2_vip6: link state changed to DOWN
Sep 13 14:13:34 kernel: opt1_vip5: link state changed to DOWN
Sep 13 14:13:34 kernel: wan_vip3: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: wan_vip3: link state changed to DOWN
Sep 13 14:13:35 kernel: opt5_vip8: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:35 kernel: opt5_vip8: link state changed to DOWN
Sep 13 14:24:28 kernel: opt5_vip8: link state changed to UP
Sep 13 14:24:28 kernel: opt7_vip15: link state changed to UP
Sep 13 14:24:29 kernel: opt4_vip7: link state changed to UP
Sep 13 14:24:29 kernel: opt6_vip9: link state changed to UP
Sep 13 14:24:29 kernel: opt11_vip11: link state changed to UP
Sep 13 14:24:29 kernel: opt2_vip6: link state changed to UP
Sep 13 14:24:29 kernel: opt1_vip5: link state changed to UP
Sep 13 14:24:29 kernel: wan_vip3: link state changed to UP

This is what the slave showed:
Sep 13 14:12:59 kernel: opt1_vip5: link state changed to DOWN
Sep 13 14:13:00 kernel: opt2_vip6: link state changed to DOWN
Sep 13 14:13:01 kernel: opt4_vip7: link state changed to DOWN
Sep 13 14:13:02 kernel: opt5_vip8: link state changed to DOWN
Sep 13 14:13:03 kernel: opt6_vip9: link state changed to DOWN
Sep 13 14:13:04 kernel: in_scrubprefix: err=65, prefix delete failed
Sep 13 14:13:05 kernel: opt11_vip11: link state changed to DOWN
Sep 13 14:13:05 kernel: in_scrubprefix: err=65, prefix delete failed
Sep 13 14:13:06 kernel: opt12_vip12: link state changed to DOWN
Sep 13 14:13:06 kernel: in_scrubprefix: err=65, prefix delete failed
Sep 13 14:13:07 kernel: wan_vip3: link state changed to DOWN
Sep 13 14:13:08 kernel: opt7_vip15: link state changed to DOWN
Sep 13 14:13:09 kernel: lan_vip1: link state changed to DOWN
Sep 13 14:13:22 kernel: carp0: changing name to 'opt1_vip5'
Sep 13 14:13:22 kernel: opt1_vip5: INIT -> BACKUP
Sep 13 14:13:22 kernel: opt1_vip5: link state changed to DOWN
Sep 13 14:13:23 kernel: carp1: changing name to 'opt2_vip6'
Sep 13 14:13:23 kernel: opt2_vip6: INIT -> BACKUP
Sep 13 14:13:23 kernel: opt2_vip6: link state changed to DOWN
Sep 13 14:13:24 kernel: carp2: changing name to 'opt4_vip7'
Sep 13 14:13:24 kernel: opt4_vip7: INIT -> BACKUP
Sep 13 14:13:24 kernel: opt4_vip7: link state changed to DOWN
Sep 13 14:13:25 kernel: carp3: changing name to 'opt5_vip8'
Sep 13 14:13:25 kernel: opt5_vip8: INIT -> BACKUP
Sep 13 14:13:25 kernel: opt5_vip8: link state changed to DOWN
Sep 13 14:13:25 kernel: opt1_vip5: link state changed to UP
Sep 13 14:13:26 kernel: carp4: changing name to 'opt6_vip9'
Sep 13 14:13:26 kernel: Restoring context for interface opt6_vip9 to 1(cpzone)
Sep 13 14:13:26 kernel: opt6_vip9: INIT -> BACKUP
Sep 13 14:13:26 kernel: opt6_vip9: link state changed to DOWN
Sep 13 14:13:26 kernel: opt2_vip6: link state changed to UP
Sep 13 14:13:27 kernel: carp5: changing name to 'opt10_vip10'
Sep 13 14:13:27 kernel: ifa_del_loopback_route: deletion failed
Sep 13 14:13:27 kernel: ifa_add_loopback_route: insertion failed
Sep 13 14:13:27 kernel: opt4_vip7: link state changed to UP
Sep 13 14:13:28 kernel: carp6: changing name to 'opt11_vip11'
Sep 13 14:13:28 kernel: opt11_vip11: INIT -> BACKUP
Sep 13 14:13:28 kernel: opt11_vip11: link state changed to DOWN
Sep 13 14:13:28 kernel: opt5_vip8: link state changed to UP
Sep 13 14:13:29 kernel: carp7: changing name to 'opt12_vip12'
Sep 13 14:13:29 kernel: opt12_vip12: INIT -> BACKUP
Sep 13 14:13:29 kernel: opt12_vip12: link state changed to DOWN
Sep 13 14:13:29 kernel: opt6_vip9: link state changed to UP
Sep 13 14:13:30 kernel: carp8: changing name to 'wan_vip3'
Sep 13 14:13:30 kernel: wan_vip3: INIT -> BACKUP
Sep 13 14:13:30 kernel: wan_vip3: link state changed to DOWN
Sep 13 14:13:31 kernel: carp9: changing name to 'opt7_vip15'
Sep 13 14:13:31 kernel: opt7_vip15: INIT -> BACKUP
Sep 13 14:13:31 kernel: opt7_vip15: link state changed to DOWN
Sep 13 14:13:31 kernel: opt11_vip11: link state changed to UP
Sep 13 14:13:32 kernel: carp10: changing name to 'lan_vip1'
Sep 13 14:13:32 kernel: lan_vip1: INIT -> BACKUP
Sep 13 14:13:32 kernel: lan_vip1: link state changed to DOWN
Sep 13 14:13:32 kernel: opt12_vip12: link state changed to UP
Sep 13 14:13:33 kernel: wan_vip3: link state changed to UP
Sep 13 14:13:34 php: /carp_status.php: waiting for pfsync…
Sep 13 14:13:34 php: /carp_status.php: pfsync done in 0 seconds.
Sep 13 14:13:34 php: /carp_status.php: Configuring CARP settings finalize...
Sep 13 14:13:34 kernel: opt7_vip15: link state changed to UP
Sep 13 14:13:35 kernel: opt12_vip12: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:35 kernel: opt12_vip12: link state changed to DOWN
Sep 13 14:24:26 kernel: opt1_vip5: link state changed to DOWN
Sep 13 14:24:27 kernel: opt2_vip6: link state changed to DOWN
Sep 13 14:24:28 kernel: opt4_vip7: link state changed to DOWN
Sep 13 14:24:28 kernel: opt12_vip12: link state changed to UP
Sep 13 14:24:28 kernel: lan_vip1: link state changed to UP
Sep 13 14:24:29 kernel: opt5_vip8: link state changed to DOWN
Sep 13 14:24:30 kernel: opt6_vip9: link state changed to DOWN
Sep 13 14:24:31 kernel: in_scrubprefix: err=65, prefix delete failed
Sep 13 14:24:32 kernel: opt11_vip11: link state changed to DOWN
Sep 13 14:24:33 kernel: opt12_vip12: link state changed to DOWN
Sep 13 14:24:34 kernel: wan_vip3: link state changed to DOWN
Sep 13 14:24:35 kernel: opt7_vip15: link state changed to DOWN
Sep 13 14:24:36 kernel: lan_vip1: link state changed to DOWN

PDJ

anyone?
I really don't know what it could be, didn't find much about this on the forums or on other pages

PDJ

Do I have to report this as a bug?
Since 2.1 we have only problems with the network, we have 14 networks and they all go down after a while.
Do we have to use different passwords for every VID ?

PDJ

I have switched the backup server off, because it was very unstable.
So what should I do? How can I fix this problem?
Anybody?

Does it help to become a gold member?

ssheikh

What does your MBUF usage look like?

nothing

If I were you, I would disconnect all the networks and leave just the WAN and 1 LAN and if this works, start connecting the rest of LANs one by one to see when it fails.

PDJ

Thanks for the answer.

I have done that, but the problem is, with all networks connected it runs for a couple of hours and suddenly it collapse, sometimes after an hour, sometimes after a day.
leaving all the networks disconnected for a day is not an option, that would mean downtime on a lot of services.

@ssheikh: good question, I'll check that.

PDJ

I has been a while, we decided to let it rest for a while and disable CARP

Now we have made a test network with the same hardware and I found out something very strange.
First of all, when the master is down and up again, the slave won't switch back to master.
when I check on the slave when I do a tcpdump I get

IP 192.168.20.252 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36

Funny thing is, that the master is configure as skew 0 instead of 240, where is that 240 comming from?

When I manually set the skew to 250 on the backup machine, I see it switch back to slave and the master becomes master.

But what causing the strange unstable behaviour? and why is the prio set to 240 ?

podilarius

Don't know. I checked mine and it is listed in tcpdump as:
<externalip>> 224.0.0.18: VRRPv2, Advertisement, vrid 124, prio 0, authtype none, intvl 1s, length 36, addrs(7): <removed to="" protect="" privacy="">It does this on all my CARP stuff. I am on 2.1 final, but all my configs are upgrades and not new installs.

drop to console and report the output of this back.
grep -e advskew -e subnet /cf/conf/config.xml</removed></externalip>

PDJ

Thanks for the answer, I found more info it has something to do with preempt, if 1 interface fails, the rest will be set to 240 so all interfaces will switch over (that's not something I prefer, but since 2006 you can't change this, pfsense has enabled this by default)
However in my case, both boxes do the same, result all interfaces have advskew 240 on master and slave, and with 20 carp networks will bring both boxes down because of the constant switching master -> backup -> master….

I have set net.inet.carp.preempt to 0 in the system tunables, but it is not changing.

podilarius

In you backup FW, do you have configuration setting sync turned on?
Personally, if I have one link fail, I would need all to fail over. Mostly this is cause I will need to bring down the master for maintenance. Also cause the WAN died and I don't want any LAN to go to the box where the WAN link failed. If its on of the LAN, sure, its not that big a deal, it will just go out the other WAN port. But you never know.

PDJ

For me it's easier to have only one failover, the setup is so that the slave doesn't have all features (no backup wan connection) so only 1 network doesn't have the failover when there is a network fail.
If all networks will switch in depended, I still can switch the master down, all networks will go down and the slave would take over all networks.

I have created a stable situation again, I found out when there is an open network (both pfsense are set to init, the network becomes unstable in a couple of hours)

But still I want to failover independent, I don't get why the option has been taken out.