CARP strange behaviour on all networks

GruensFroeschli

This sounds to me like you have some kind of packetstorm.
Are you using VLANs?
Configured a bridge?
Connected something somewhere to "save hardware"?

PDJ

All answers are No,

we do not have VLANs, no bridges configured and every subnet has it's own phisical ethernet adapter (port)
at the moment I have dissabled CARP on the slave because when enabled the network is unstable.
I have to say we were on 2.0 before, we had some problems with the WAN VIPs but it worked just fine, the real collapse started with 2.1

I have checked the switches during outage, but there were no very high loads on any port (from the one I have checked, don't have much time to check when all networks are down)

This is what the logfile showed (on the master):
Sep 13 14:13:01 kernel: opt12_vip12: link state changed to UP
Sep 13 14:13:01 kernel: lan_vip1: link state changed to UP
Sep 13 14:13:01 kernel: opt4_vip7: link state changed to UP
Sep 13 14:13:01 kernel: opt6_vip9: link state changed to UP
Sep 13 14:13:01 kernel: opt7_vip15: link state changed to UP
Sep 13 14:13:01 kernel: opt2_vip6: link state changed to UP
Sep 13 14:13:01 kernel: opt1_vip5: link state changed to UP
Sep 13 14:13:01 kernel: wan_vip3: link state changed to UP
Sep 13 14:13:01 kernel: opt11_vip11: link state changed to UP
Sep 13 14:13:01 kernel: opt5_vip8: link state changed to UP
Sep 13 14:13:34 kernel: opt7_vip15: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt7_vip15: link state changed to DOWN
Sep 13 14:13:34 kernel: opt4_vip7: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt4_vip7: link state changed to DOWN
Sep 13 14:13:34 kernel: opt6_vip9: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt6_vip9: link state changed to DOWN
Sep 13 14:13:34 kernel: opt11_vip11: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt11_vip11: link state changed to DOWN
Sep 13 14:13:34 kernel: opt2_vip6: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt1_vip5: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: opt2_vip6: link state changed to DOWN
Sep 13 14:13:34 kernel: opt1_vip5: link state changed to DOWN
Sep 13 14:13:34 kernel: wan_vip3: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:34 kernel: wan_vip3: link state changed to DOWN
Sep 13 14:13:35 kernel: opt5_vip8: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:35 kernel: opt5_vip8: link state changed to DOWN
Sep 13 14:24:28 kernel: opt5_vip8: link state changed to UP
Sep 13 14:24:28 kernel: opt7_vip15: link state changed to UP
Sep 13 14:24:29 kernel: opt4_vip7: link state changed to UP
Sep 13 14:24:29 kernel: opt6_vip9: link state changed to UP
Sep 13 14:24:29 kernel: opt11_vip11: link state changed to UP
Sep 13 14:24:29 kernel: opt2_vip6: link state changed to UP
Sep 13 14:24:29 kernel: opt1_vip5: link state changed to UP
Sep 13 14:24:29 kernel: wan_vip3: link state changed to UP

This is what the slave showed:
Sep 13 14:12:59 kernel: opt1_vip5: link state changed to DOWN
Sep 13 14:13:00 kernel: opt2_vip6: link state changed to DOWN
Sep 13 14:13:01 kernel: opt4_vip7: link state changed to DOWN
Sep 13 14:13:02 kernel: opt5_vip8: link state changed to DOWN
Sep 13 14:13:03 kernel: opt6_vip9: link state changed to DOWN
Sep 13 14:13:04 kernel: in_scrubprefix: err=65, prefix delete failed
Sep 13 14:13:05 kernel: opt11_vip11: link state changed to DOWN
Sep 13 14:13:05 kernel: in_scrubprefix: err=65, prefix delete failed
Sep 13 14:13:06 kernel: opt12_vip12: link state changed to DOWN
Sep 13 14:13:06 kernel: in_scrubprefix: err=65, prefix delete failed
Sep 13 14:13:07 kernel: wan_vip3: link state changed to DOWN
Sep 13 14:13:08 kernel: opt7_vip15: link state changed to DOWN
Sep 13 14:13:09 kernel: lan_vip1: link state changed to DOWN
Sep 13 14:13:22 kernel: carp0: changing name to 'opt1_vip5'
Sep 13 14:13:22 kernel: opt1_vip5: INIT -> BACKUP
Sep 13 14:13:22 kernel: opt1_vip5: link state changed to DOWN
Sep 13 14:13:23 kernel: carp1: changing name to 'opt2_vip6'
Sep 13 14:13:23 kernel: opt2_vip6: INIT -> BACKUP
Sep 13 14:13:23 kernel: opt2_vip6: link state changed to DOWN
Sep 13 14:13:24 kernel: carp2: changing name to 'opt4_vip7'
Sep 13 14:13:24 kernel: opt4_vip7: INIT -> BACKUP
Sep 13 14:13:24 kernel: opt4_vip7: link state changed to DOWN
Sep 13 14:13:25 kernel: carp3: changing name to 'opt5_vip8'
Sep 13 14:13:25 kernel: opt5_vip8: INIT -> BACKUP
Sep 13 14:13:25 kernel: opt5_vip8: link state changed to DOWN
Sep 13 14:13:25 kernel: opt1_vip5: link state changed to UP
Sep 13 14:13:26 kernel: carp4: changing name to 'opt6_vip9'
Sep 13 14:13:26 kernel: Restoring context for interface opt6_vip9 to 1(cpzone)
Sep 13 14:13:26 kernel: opt6_vip9: INIT -> BACKUP
Sep 13 14:13:26 kernel: opt6_vip9: link state changed to DOWN
Sep 13 14:13:26 kernel: opt2_vip6: link state changed to UP
Sep 13 14:13:27 kernel: carp5: changing name to 'opt10_vip10'
Sep 13 14:13:27 kernel: ifa_del_loopback_route: deletion failed
Sep 13 14:13:27 kernel: ifa_add_loopback_route: insertion failed
Sep 13 14:13:27 kernel: opt4_vip7: link state changed to UP
Sep 13 14:13:28 kernel: carp6: changing name to 'opt11_vip11'
Sep 13 14:13:28 kernel: opt11_vip11: INIT -> BACKUP
Sep 13 14:13:28 kernel: opt11_vip11: link state changed to DOWN
Sep 13 14:13:28 kernel: opt5_vip8: link state changed to UP
Sep 13 14:13:29 kernel: carp7: changing name to 'opt12_vip12'
Sep 13 14:13:29 kernel: opt12_vip12: INIT -> BACKUP
Sep 13 14:13:29 kernel: opt12_vip12: link state changed to DOWN
Sep 13 14:13:29 kernel: opt6_vip9: link state changed to UP
Sep 13 14:13:30 kernel: carp8: changing name to 'wan_vip3'
Sep 13 14:13:30 kernel: wan_vip3: INIT -> BACKUP
Sep 13 14:13:30 kernel: wan_vip3: link state changed to DOWN
Sep 13 14:13:31 kernel: carp9: changing name to 'opt7_vip15'
Sep 13 14:13:31 kernel: opt7_vip15: INIT -> BACKUP
Sep 13 14:13:31 kernel: opt7_vip15: link state changed to DOWN
Sep 13 14:13:31 kernel: opt11_vip11: link state changed to UP
Sep 13 14:13:32 kernel: carp10: changing name to 'lan_vip1'
Sep 13 14:13:32 kernel: lan_vip1: INIT -> BACKUP
Sep 13 14:13:32 kernel: lan_vip1: link state changed to DOWN
Sep 13 14:13:32 kernel: opt12_vip12: link state changed to UP
Sep 13 14:13:33 kernel: wan_vip3: link state changed to UP
Sep 13 14:13:34 php: /carp_status.php: waiting for pfsync…
Sep 13 14:13:34 php: /carp_status.php: pfsync done in 0 seconds.
Sep 13 14:13:34 php: /carp_status.php: Configuring CARP settings finalize...
Sep 13 14:13:34 kernel: opt7_vip15: link state changed to UP
Sep 13 14:13:35 kernel: opt12_vip12: MASTER -> BACKUP (more frequent advertisement received)
Sep 13 14:13:35 kernel: opt12_vip12: link state changed to DOWN
Sep 13 14:24:26 kernel: opt1_vip5: link state changed to DOWN
Sep 13 14:24:27 kernel: opt2_vip6: link state changed to DOWN
Sep 13 14:24:28 kernel: opt4_vip7: link state changed to DOWN
Sep 13 14:24:28 kernel: opt12_vip12: link state changed to UP
Sep 13 14:24:28 kernel: lan_vip1: link state changed to UP
Sep 13 14:24:29 kernel: opt5_vip8: link state changed to DOWN
Sep 13 14:24:30 kernel: opt6_vip9: link state changed to DOWN
Sep 13 14:24:31 kernel: in_scrubprefix: err=65, prefix delete failed
Sep 13 14:24:32 kernel: opt11_vip11: link state changed to DOWN
Sep 13 14:24:33 kernel: opt12_vip12: link state changed to DOWN
Sep 13 14:24:34 kernel: wan_vip3: link state changed to DOWN
Sep 13 14:24:35 kernel: opt7_vip15: link state changed to DOWN
Sep 13 14:24:36 kernel: lan_vip1: link state changed to DOWN

PDJ

anyone?
I really don't know what it could be, didn't find much about this on the forums or on other pages

PDJ

Do I have to report this as a bug?
Since 2.1 we have only problems with the network, we have 14 networks and they all go down after a while.
Do we have to use different passwords for every VID ?

PDJ

I have switched the backup server off, because it was very unstable.
So what should I do? How can I fix this problem?
Anybody?

Does it help to become a gold member?

ssheikh

What does your MBUF usage look like?

nothing

If I were you, I would disconnect all the networks and leave just the WAN and 1 LAN and if this works, start connecting the rest of LANs one by one to see when it fails.

PDJ

Thanks for the answer.

I have done that, but the problem is, with all networks connected it runs for a couple of hours and suddenly it collapse, sometimes after an hour, sometimes after a day.
leaving all the networks disconnected for a day is not an option, that would mean downtime on a lot of services.

@ssheikh: good question, I'll check that.

PDJ

I has been a while, we decided to let it rest for a while and disable CARP

Now we have made a test network with the same hardware and I found out something very strange.
First of all, when the master is down and up again, the slave won't switch back to master.
when I check on the slave when I do a tcpdump I get

IP 192.168.20.252 > 224.0.0.18: VRRPv2, Advertisement, vrid 5, prio 240, authtype none, intvl 1s, length 36

Funny thing is, that the master is configure as skew 0 instead of 240, where is that 240 comming from?

When I manually set the skew to 250 on the backup machine, I see it switch back to slave and the master becomes master.

But what causing the strange unstable behaviour? and why is the prio set to 240 ?

podilarius

Don't know. I checked mine and it is listed in tcpdump as:
<externalip>> 224.0.0.18: VRRPv2, Advertisement, vrid 124, prio 0, authtype none, intvl 1s, length 36, addrs(7): <removed to="" protect="" privacy="">It does this on all my CARP stuff. I am on 2.1 final, but all my configs are upgrades and not new installs.

drop to console and report the output of this back.
grep -e advskew -e subnet /cf/conf/config.xml</removed></externalip>

PDJ

Thanks for the answer, I found more info it has something to do with preempt, if 1 interface fails, the rest will be set to 240 so all interfaces will switch over (that's not something I prefer, but since 2006 you can't change this, pfsense has enabled this by default)
However in my case, both boxes do the same, result all interfaces have advskew 240 on master and slave, and with 20 carp networks will bring both boxes down because of the constant switching master -> backup -> master….

I have set net.inet.carp.preempt to 0 in the system tunables, but it is not changing.

podilarius

In you backup FW, do you have configuration setting sync turned on?
Personally, if I have one link fail, I would need all to fail over. Mostly this is cause I will need to bring down the master for maintenance. Also cause the WAN died and I don't want any LAN to go to the box where the WAN link failed. If its on of the LAN, sure, its not that big a deal, it will just go out the other WAN port. But you never know.

PDJ

For me it's easier to have only one failover, the setup is so that the slave doesn't have all features (no backup wan connection) so only 1 network doesn't have the failover when there is a network fail.
If all networks will switch in depended, I still can switch the master down, all networks will go down and the slave would take over all networks.

I have created a stable situation again, I found out when there is an open network (both pfsense are set to init, the network becomes unstable in a couple of hours)

But still I want to failover independent, I don't get why the option has been taken out.