CARP Problems

sullrich

It stays at 200 for 60-90 seconds on bootup then switches back.

NAmorim

It stays with advskew 200 all the time.

Hi yall.

I'm spending a couple of hours wondering why CARP ain't working on two boxes.
After creating VIP's their advskew are different, interfaces don't come up etc.

Then, I checked the "/etc/inc/interfaces.inc" (using 1.0BETA2, and think on other versions too)
I think this is the reason (lines 408 and 409):
fwrite($fd, "/sbin/ifconfig carp" . $carp_instances_counter . " " . $vip['subnet'] . "/" . $vip['subnet_bits'] . " broadcast "
. $broadcast_address . " vhid " . $vip['vhid'] . "{$carpdev} advskew 200 " . $password . "\n");
409 mwexec("/sbin/ifconfig carp" . $carp_instances_counter . " " . $vip['subnet'] . "/" . $vip['subnet_bits'] . " broadcast " . $b
roadcast_address . " vhid " . $vip['vhid'] . "{$carpdev} advskew 200 " . $password);

the advskew is code-fixed to 200, no matter what is in configuration (/conf/config.xml).

So, you can edit the /etc/inc/interfaces.inc, go to line 408 and 409, convert then to this

fwrite($fd, "/sbin/ifconfig carp" . $carp_instances_counter . " " . $vip['subnet'] . "/" . $vip['subnet_bits'] . " broadcast " . $broadcast_address . " vhid " . $vip['vhid'] . "{$carpdev} advskew " . $vip['advskew'] . " " . $password . "\n");
mwexec("/sbin/ifconfig carp" . $carp_instances_counter . " " . $vip['subnet'] . "/" . $vip['subnet_bits'] . " broadcast " . $broadcast_address . " vhid " . $vip['vhid'] . "{$carpdev} advskew " . $vip['advskew'] . " " . $password);

In the previous version I had this changed. I thought that it was already in cvs, but only the sleep issue was changed.

sullrich

Then you have a configuration issue. Check these issues:

Make sure you have a static address on each of the pfsync interfaces in the same subnet
Try pinging the other end of pfsync to ensure connectivity (if this doesnt work, then stop here and double check everything)
Make sure each CARP ip has the same VHID shared across the cluster per ip
Make sure each CARP pair has the same password

ane

@sullrich:

Then you have a configuration issue. Check these issues:

Make sure you have a static address on each of the pfsync interfaces in the same subnet

Try pinging the other end of pfsync to ensure connectivity (if this doesnt work, then stop here and double check everything)

Make sure each CARP ip has the same VHID shared across the cluster per ip

Make sure each CARP pair has the same password

I have checked all of the obove, but…

        Master
       ___________   ~~~~~
       |     sis2|----DMZ
---WAN-|sis1     |   ~~~~~
   |   |         |                     ~~~~~
   |   |_____sis0|----LAN---------------LAN
   |                   |               ~~~~~
   |                   |           ~~~~~
   |                   |___VLAN0 - pfsync
   |                   |           ~~~~~
   |                   |           
   |                   |           ~~~~~
   |                   |___VLAN1 - WLAN
   |    Backup                     ~~~~~
   |
   |   ___________   ~~~~~
   |   |     sis2|----DMZ
---WAN-|sis1     |   ~~~~~
       |         |                     ~~~~~
       |_____sis0|----LAN---------------LAN
                      |                ~~~~~
                      |           ~~~~~
                      |___VLAN0 - pfsync
                      |           ~~~~~
                      |           
                      |          ~~~~~
                      |___VLAN1 - WLAN
                                 ~~~~~

I configured CARP-VIPs for the DMZ, LAN and WLAN-vlan.

Now I have the same phenomenon as described before:
the boxes keep changing Master/Slave on DMZ and LAN, the backup box being Master most of the time.

On the vlan however, both insist on being master. tcpdump on LAN shows the same strangeness in changing advskew.

I have * * * * * rules for all non-WAN interfaces.

Edit: Here's the ifconfig output of the Master


ifconfig 
sis0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>mtu 1500
        options=8 <vlan_mtu>inet6 fe80::20d:b9ff:fe02:7a8c%sis0 prefixlen 64 scopeid 0x1 
        inet 10.1.1.1 netmask 0xffff0000 broadcast 10.1.255.255
        ether 00:0d:b9:02:7a:8c
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
sis1: flags=8843 <up,broadcast,running,simplex,multicast>mtu 1500
        options=8 <vlan_mtu>inet6 fe80::20d:b9ff:fe02:7a8d%sis1 prefixlen 64 scopeid 0x2 
        ether 00:0d:b9:02:7a:8d
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
sis2: flags=8943 <up,broadcast,running,promisc,simplex,multicast>mtu 1500
        options=8 <vlan_mtu>inet 10.5.1.1 netmask 0xffff0000 broadcast 10.5.255.255
        inet6 fe80::20d:b9ff:fe02:7a8e%sis2 prefixlen 64 scopeid 0x3 
        ether 00:0d:b9:02:7a:8e
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
pfsync0: flags=41 <up,running>mtu 1348
        pfsync: syncdev: vlan0 maxupd: 128
lo0: flags=8049 <up,loopback,running,multicast>mtu 16384
        inet 127.0.0.1 netmask 0xff000000 
        inet6 ::1 prefixlen 128 
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x5 
pflog0: flags=100 <promisc>mtu 33208
vlan0: flags=8843 <up,broadcast,running,simplex,multicast>mtu 1500
        inet 192.168.254.1 netmask 0xffffff00 broadcast 192.168.254.255
        inet6 fe80::20d:b9ff:fe02:7a8c%vlan0 prefixlen 64 scopeid 0x7 
        ether 00:0d:b9:02:7a:8c
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
        vlan: 30 parent interface: sis0
vlan1: flags=8943 <up,broadcast,running,promisc,simplex,multicast>mtu 1500
        inet 10.4.1.1 netmask 0xffff0000 broadcast 10.4.255.255
        inet6 fe80::20d:b9ff:fe02:7a8c%vlan1 prefixlen 64 scopeid 0x8 
        ether 00:0d:b9:02:7a:8c
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
        vlan: 4 parent interface: sis0
ng0: flags=88d1 <up,pointopoint,running,noarp,simplex,multicast>mtu 1492
        inet6 fe80::20d:b9ff:fe02:7a8c%ng0 prefixlen 64 scopeid 0x9 
        inet 80.136.201.83 --> 217.0.116.148 netmask 0xffffffff 
carp0: flags=49 <up,loopback,running>mtu 1500
        inet 10.1.1.10 netmask 0xffff0000 
        carp: BACKUP vhid 1 advbase 1 advskew 200
carp1: flags=49 <up,loopback,running>mtu 1500
        inet 10.4.1.10 netmask 0xffff0000 
        carp: BACKUP vhid 4 advbase 1 advskew 200
carp2: flags=49 <up,loopback,running>mtu 1500
        inet 10.5.1.10 netmask 0xffff0000 
        carp: MASTER vhid 5 advbase 1 advskew 200</up,loopback,running></up,loopback,running></up,loopback,running></up,pointopoint,running,noarp,simplex,multicast></full-duplex></up,broadcast,running,promisc,simplex,multicast></full-duplex></up,broadcast,running,simplex,multicast></promisc></up,loopback,running,multicast></up,running></full-duplex></vlan_mtu></up,broadcast,running,promisc,simplex,multicast></full-duplex></vlan_mtu></up,broadcast,running,simplex,multicast></full-duplex></vlan_mtu></up,broadcast,running,promisc,simplex,multicast>

ifconfig on Backup


sis0: flags=8943 <up,broadcast,running,promisc,simplex,multicast>mtu 1500
        options=8 <vlan_mtu>inet6 fe80::20d:b9ff:fe02:8094%sis0 prefixlen 64 scopeid 0x1 
        inet 10.1.1.5 netmask 0xffff0000 broadcast 10.1.255.255
        ether 00:0d:b9:02:80:94
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
sis1: flags=8843 <up,broadcast,running,simplex,multicast>mtu 1500
        options=8 <vlan_mtu>inet6 fe80::20d:b9ff:fe02:8095%sis1 prefixlen 64 scopeid 0x2 
        ether 00:0d:b9:02:80:95
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
sis2: flags=8943 <up,broadcast,running,promisc,simplex,multicast>mtu 1500
        options=8 <vlan_mtu>inet 10.5.1.5 netmask 0xffff0000 broadcast 10.5.255.255
        inet6 fe80::20d:b9ff:fe02:8096%sis2 prefixlen 64 scopeid 0x3 
        ether 00:0d:b9:02:80:96
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
pfsync0: flags=41 <up,running>mtu 1348
        pfsync: syncdev: vlan0 maxupd: 128
lo0: flags=8049 <up,loopback,running,multicast>mtu 16384
        inet 127.0.0.1 netmask 0xff000000 
        inet6 ::1 prefixlen 128 
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x5 
pflog0: flags=100 <promisc>mtu 33208
vlan0: flags=8843 <up,broadcast,running,simplex,multicast>mtu 1500
        inet 192.168.254.2 netmask 0xffffff00 broadcast 192.168.254.255
        inet6 fe80::20d:b9ff:fe02:8094%vlan0 prefixlen 64 scopeid 0x7 
        ether 00:0d:b9:02:80:94
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
        vlan: 30 parent interface: sis0
vlan1: flags=8943 <up,broadcast,running,promisc,simplex,multicast>mtu 1500
        inet 10.4.1.5 netmask 0xffff0000 broadcast 10.4.255.255
        inet6 fe80::20d:b9ff:fe02:8094%vlan1 prefixlen 64 scopeid 0x8 
        ether 00:0d:b9:02:80:94
        media: Ethernet autoselect (100baseTX <full-duplex>)
        status: active
        vlan: 4 parent interface: sis0
ng0: flags=8890 <pointopoint,noarp,simplex,multicast>mtu 1500
carp0: flags=49 <up,loopback,running>mtu 1500
        inet 10.1.1.10 netmask 0xffff0000 
        carp: MASTER vhid 1 advbase 1 advskew 200
carp1: flags=49 <up,loopback,running>mtu 1500
        inet 10.4.1.10 netmask 0xffff0000 
        carp: MASTER vhid 4 advbase 1 advskew 200
carp2: flags=49 <up,loopback,running>mtu 1500
        inet 10.5.1.10 netmask 0xffff0000 
        carp: MASTER vhid 5 advbase 1 advskew 200</up,loopback,running></up,loopback,running></up,loopback,running></pointopoint,noarp,simplex,multicast></full-duplex></up,broadcast,running,promisc,simplex,multicast></full-duplex></up,broadcast,running,simplex,multicast></promisc></up,loopback,running,multicast></up,running></full-duplex></vlan_mtu></up,broadcast,running,promisc,simplex,multicast></full-duplex></vlan_mtu></up,broadcast,running,simplex,multicast></full-duplex></vlan_mtu></up,broadcast,running,promisc,simplex,multicast>

I can ping the DMZ if from Master to Backup, but not vice versa.

tcpdump on LAN:

23:32:04.572009 IP Backup > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 20, authtype none, intvl 1s, length 36
23:32:05.698596 IP Backup > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 20, authtype none, intvl 1s, length 36
23:32:06.824884 IP Backup > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 20, authtype none, intvl 1s, length 36
23:32:10.613710 IP master > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
23:32:12.354547 IP master > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
23:32:14.300326 IP master > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
….
...
23:35:17.600611 IP master > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
23:35:19.546316 IP master > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
23:35:21.492071 IP master > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 240, authtype none, intvl 1s, length 36
23:35:21.492303 IP Backup > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 200, authtype none, intvl 1s, length 36
23:35:23.335285 IP Backup > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 200, authtype none, intvl 1s, length 36
23:35:25.076075 IP Backup > vrrp.mcast.net: VRRPv2, Advertisement, vrid 1, prio 200, authtype none, intvl 1s, length 36

Setup is currently BETA4

Juve

I have the same problem (brand new install of beta4).
After configuring CARP on each firewall, some of the interfaces of the master are in backup mode and some other in master mode, the same appears on the slave firewall.

if I do an ifconfig carp0 carp1 etc… then I can see that the advskew is set to 200 to all carp interfaces on the two firewalls even if I have set 0 on the master one. Bakcuping the configuring and editing the XML file shows up the right configuration (0 for master VIPs and 200 for slave).

then if I modify the /tmp/carp.sh on the master by putting the advskew at 0, I destroy all carp interfaces and execute carp.sh all is fine because master is master !

If I modify the code where the advskee is hard coded on the master firewall then all is fine too.

sullrich

@Juve:

I have the same problem (brand new install of beta4).
After configuring CARP on each firewall, some of the interfaces of the master are in backup mode and some other in master mode, the same appears on the slave firewall.

if I do an ifconfig carp0 carp1 etc… then I can see that the advskew is set to 200 to all carp interfaces on the two firewalls even if I have set 0 on the master one. Bakcuping the configuring and editing the XML file shows up the right configuration (0 for master VIPs and 200 for slave).

then if I modify the /tmp/carp.sh on the master by putting the advskew at 0, I destroy all carp interfaces and execute carp.sh all is fine because master is master !

If I modify the code where the advskee is hard coded on the master firewall then all is fine too.

It will have a advertising skew until the final carp bringup process (about 2 minutes after the firewall is completely booted up). You can view the progress on the console.

In terms of having interfaces being master or backup and being wrong, this means that carp is not communicating on the interface themselves. It needs to be able to broadcast and talk to the other firewall on that interface in question.

iimre

@sullrich:

In terms of having interfaces being master or backup and being wrong, this means that carp is not communicating on the interface themselves. It needs to be able to broadcast and talk to the other firewall on that interface in question.

How could I test it. Because I'm facing the similar problem, one of my carp interfaces out of the four are "master-master" no matter what I do. Simple ping goest fine to and fro'. Nothing seems to be blocked in the logs. I have already changed NIC's and switches without success.

sullrich

@iimre:

@sullrich:

In terms of having interfaces being master or backup and being wrong, this means that carp is not communicating on the interface themselves. It needs to be able to broadcast and talk to the other firewall on that interface in question.

How could I test it. Because I'm facing the similar problem, one of my carp interfaces out of the four are "master-master" no matter what I do. Simple ping goest fine to and fro'. Nothing seems to be blocked in the logs. I have already changed NIC's and switches without success.

If you have not seen the CARP tutorial on our site then you need to follow it. It will guide you in setting up the primary box which sycns the configuration to the secondaries. The reason this is important is because it ensures that the advskew and also the vhid are correct across all cluster members. It also ensures that the passwords match per vhid. Place a crossover cable between the two wan interfaces. Does the problem persist? If so you have a mismatched configuration somewhere.

iimre

@sullrich:

If you have not seen the CARP tutorial on our site then you need to follow it.

I did exatly that.

It will guide you in setting up the primary box which sycns the configuration to the secondaries. The reason this is important is because it ensures that the advskew and also the vhid are correct across all cluster members. It also ensures that the passwords match per vhid. Place a crossover cable between the two wan interfaces.

I have already tried this. Not only the wan but all the interface pairs, one by one. I will make some other xover cables tomorrow and will make a try with connecting all interface pairs (WAN, WAN2, DMZ and LAN) with xover (they carp syncronization interface is ofcourse permanently xovered).

Does the problem persist?

Yes :(

If so you have a mismatched configuration somewhere.

Yes probably, but I have tried to build up several times from scratch, with only the (as I guess) the minimal neccessary configuration. So now I have no idea what could be the problem.
Anyhow, it seems to function well, on all the two WAN interfaces either from LAN or DMZ, but I afraid that there is a hidden problem which can cause a collapse in the worst moment.

sullrich

Post screen shots of each of the machines virtual ips configuration so we can inspect.

iimre

@sullrich:

Post screen shots of each of the machines virtual ips configuration so we can inspect.

I attached as you asked. I reduced the sizes as possible, hoping that they are still readable.
Thank you for your help

Imre

pfsense1__dmz-carp.jpg_thumb
pfsense1__lan-carp.jpg_thumb
pfsense1__wan2-carp.jpg_thumb
pfsense1__wan-carp.jpg_thumb
pfsense2__dmz-carp.jpg_thumb
pfsense2__lan-carp.jpg_thumb
pfsense2__wan2-carp.jpg_thumb
pfsense2__wan-carp.jpg_thumb

sullrich

Each of the same ip's need to share the same vhid group… They are unique in your setup which also tells me that you didnt follow the tutorial as it would have sync'd the configuration to the backup node ensuring this is all the way it should be. >:(

iimre

@sullrich:

Each of the same ip's need to share the same vhid group… They are unique in your setup which also tells me that you didnt follow the tutorial as it would have sync'd the configuration to the backup node ensuring this is all the way it should be. >:(

Sorry .then I probably misunderstandig something :(
xxx.xxx.xxx.165's VHID=1
xxx.xxx.xxx.116's VHID=2
10.0.254.4'd VHID=3
192.168.0.10's VHID=4
the same kind of interfaces have the same vhid group number.
I'm confused. All of the 4 should have the same?

sullrich

@iimre:

@sullrich:

Each of the same ip's need to share the same vhid group… They are unique in your setup which also tells me that you didnt follow the tutorial as it would have sync'd the configuration to the backup node ensuring this is all the way it should be. >:(

Sorry .then I probably misunderstandig something :(
xxx.xxx.xxx.165's VHID=1
xxx.xxx.xxx.116's VHID=2
10.0.254.4'd VHID=3
192.168.0.10's VHID=4
the same kind of interfaces have the same vhid group number.
I'm confused. All of the 4 should have the same?

Each unique IP needs to have its on VHID. The VHID needs to match on each machine.

If you are using the Sync option as the tutorial shows, this is all automatic.

iimre

@sullrich:

Each unique IP needs to have its on VHID.

It is.

The VHID needs to match on each machine.

They do.

If you are using the Sync option as the tutorial shows, this is all automatic.

I did and I see them to be the same, but please let me know which one is not matching. it is probably my fault, but I really don't see.

Juve

I just want to add something to know before activating sync over XML-RPC. When having a lot of rule in the filter, it is not possible (in terms of 'useability') to use the rule sync over XML-RPC. I have tested it on a cluster wich have between 700 and 800 rules… when you modify one thing the sync starts and then the firewall goes to 100% CPU (php process) during many many minutes loosing control on everything. This was tested on 2 IBM x336 intel Xeon 3.2Ghz dual core with 2Gb of RAM and 80Gb SATA hard drives.

What I do is manual sync using partial backups ;-) and it's fine I'm not adding rules every minute ;-)

sullrich

@Juve:

I just want to add something to know before activating sync over XML-RPC. When having a lot of rule in the filter, it is not possible (in terms of 'useability') to use the rule sync over XML-RPC. I have tested it on a cluster wich have between 700 and 800 rules… when you modify one thing the sync starts and then the firewall goes to 100% CPU (php process) during many many minutes loosing control on everything. This was tested on 2 IBM x336 intel Xeon 3.2Ghz dual core with 2Gb of RAM and 80Gb SATA hard drives.

What I do is manual sync using partial backups ;-) and it's fine I'm not adding rules every minute ;-)

I don't really want to hijack this thread but could you please start a new topic that explains the pain and frustration of managing such a large ruleset in a new topic? We can begin to brainstorm how to improve this situation.

Juve

I really hope you don't think I'm complaining. The previous post was just a sort of "advice" for those who have not tried it yet.

Regards.

sullrich

@Juve:

I really hope you don't think I'm complaining. The previous post was just a sort of "advice" for those who have not tried it yet.

Regards.

Not at all. I just can imagine that managing that large amount of rules must be painful. I am looking for information on what you don't like, what is hard to do, etc for future improvements…

iimre

Hi,

Just for the record, my problem is solved. It was a ruling mistake on DMZ, ie. a directed all traffic destined to elswhere then LAN or DMZ to the load balancer (WAN1 + WAN2), but this way the traffic to 224.0.0.x went out to the net.
Thanks for all who tried to help me to solve this problem.