State table not synced?

bmaster

We have two pfsense nodes, with Carp installed and working. Each node has a LAN and WAN interface, and a third sync interface, configured as described in the pfsense book. Configuration changes (firewall rules, traffic shaper, …) are synced to the backup node, and failover seems to work. However, when I start a tcp connection from my pc to somewhere on the internet, then force a failover (by unplugging the network cables of the master node), the tcp connection is dropped. When I compare the state table (diagnostics -> states) I see that they are nowhere near identical. What am I doing wrong? Thanks in advance!

PS: on the slave node, there's nothing checked or filled in under 'carp settings'. Is that ok?

jimp

Anything in the system log around the time of the changeover?

bmaster

Nothing special. On the master:


Jul 6 16:48:52 	kernel: carp0: link state changed to DOWN
Jul 6 16:48:52 	kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Jul 6 16:48:51 	kernel: carp1: link state changed to DOWN
Jul 6 16:48:51 	kernel: re0: link state changed to DOWN

And on the backup:


Jul 6 16:48:53 	kernel: carp1: link state changed to UP
Jul 6 16:48:52 	kernel: carp0: link state changed to UP
Jul 6 16:48:52 	kernel: carp0: BACKUP -> MASTER (preempting a slower master)

EDIT: Fixed it. On the slave machine, you have to enable "Synchronize Enabled" under CARP settings. This is quite unclear, because in the book it says "you should not configure synchronization from the backup to the master"…

bmaster

One more thing I noticed. This time I tested the failover by unplugging the power cord from the master (this is never good, but it can happen in the real world…). Failover to the slave works fine. Then I plug in the power on the master again, so it starts booting. After a few moments, one of the two carp interfaces (the LAN side) switches back to the master box, but the other carp interface stays on the slave. Only after 2 minutes they both are on the master again. The problem here is that any tcp connections get messed up of course.

Below is some logging at the moment that the master is starting up again. There you see that on the master both carp interfaces are up at 08:47:04, while on the slave carp0 is up 2 minutes later than carp1...

Master node:


Jul 7 08:47:04 	kernel: carp1: link state changed to UP
Jul 7 08:47:04 	kernel: carp1: INIT -> MASTER (preempting)
Jul 7 08:47:04 	kernel: carp1: link state changed to DOWN
Jul 7 08:47:04 	kernel: carp0: link state changed to UP
Jul 7 08:47:04 	kernel: carp0: INIT -> MASTER (preempting)
Jul 7 08:47:04 	kernel: carp0: link state changed to DOWN
Jul 7 08:47:02 	pftpx[556]: listening on 127.0.0.1 port 8021
Jul 7 08:47:02 	pftpx[556]: listening on 127.0.0.1 port 8021
Jul 7 08:47:10 	kernel: carp1: link state changed to DOWN
Jul 7 08:47:10 	kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)
Jul 7 08:47:10 	kernel: carp1: link state changed to UP
Jul 7 08:47:06 	kernel: carp1: link state changed to DOWN
Jul 7 08:47:06 	kernel: carp1: 2 link states coalesced
Jul 7 08:47:06 	kernel: re0: link state changed to UP
Jul 7 08:47:06 	kernel: carp1: INIT -> BACKUP
Jul 7 08:47:06 	kernel: bge0: link state changed to UP
Jul 7 08:47:05 	kernel: carp0: link state changed to DOWN
Jul 7 08:47:05 	kernel: carp0: 2 link states coalesced
Jul 7 08:47:05 	kernel: em1: link state changed to UP
Jul 7 08:47:05 	kernel: carp0: INIT -> BACKUP

Slave node:


Jul 7 08:49:14 	kernel: carp0: link state changed to DOWN
Jul 7 08:49:14 	kernel: carp0: MASTER -> BACKUP (more frequent advertisement received)
Jul 7 08:47:04 	kernel: carp1: link state changed to DOWN
Jul 7 08:47:04 	kernel: carp1: MASTER -> BACKUP (more frequent advertisement received)

jimp

Are those log entries from the master in order? If so, the time is a little out of whack on that server.

bmaster

I didn't even notice those timestamps :o Could it be that its time was a couple of seconds wrong after the shutdown, and that it synced it again during boot? Just a wild guess… But I think that would not explain why it takes 2 minutes for the slave server to change carp0 from master to backup, right?

jimp

It's a known issue that if the clocks are off, CARP will not be right, but usually you get some messages like "Incorrect Hash".

Are these physical systems or VMs? If they're physical, what kind of hardware is involved?

If the time problems happen on every boot, there may be a BIOS or RTC issue, or it could be an ACPI issue. There are a couple different timecounters that can be set on the system, changing that setting might improve the situation as well.

bmaster

you say "if the clocks are off"… do you mean a couple of milliseconds or minutes or ... ?

The two boxes are physical machines: two identical HP DC7600 computers (Dual Core 3.4Ghz, 1GB ram, onboard Broadcom NetXtreme network interface) with some extra network cards installed.

I have to do some extra testing to tell if it's on every boot. Changing which setting might improve the situation?

EDIT: We did another test. We made box 1 the backup node, and box 2 the master node. Then we rebooted the master (box 2). After reboot the master shows MASTER on all interfaces, but the slave shows MASTER for carp0 and SLAVE for carp1. After about 2 minutes, both interfaces show SLAVE.

Note: In the log file for box2, there's no "time jump" like we saw in the log file for box1 with our previous tests. We entered the correct time in the bios for box1, but for every reboot we see a small time jump (about 8 seconds) in the log file. I don't think this is a problem though because in the last test, box1 keeps running so I assume its time is correct.

EDIT2: We had another identical pc to test with, so I replaced box 1 with this 3rd box, built over the network cards, restored the backup of config.xml and tried rebooting box2 again. Same thing happens: when box2 wakes up, both its interfaces are master, but interface carp0 of box3 stays master as well (for about 2 minutes). All ideas welcome :-)

bmaster

Any more ideas Jim?

jimp

Unless there's something specific to the switch you're using, I'm not sure.

You could try setting the advskew of the CARP IPs on the backup even higher (200 or so) but I'm not sure that would really make that large of a difference.

bmaster

on the LAN side both machines are connected to a stack of two Nortel Baystack 5510 switches, each machine connected to a different unit (machine 1 on unit 1, machine 2 on unit 2). On the WAN side, they are connected to the built-in switch of the Speedtouch modem that we have to use for our ISP.

I'll try the tip tomorrow and post the results…

bmaster

Setting the advertising frequency on the slave box to 200 didn't change anything. Is there a known issue with carp on speedtouch routers that you know of?

Fixed it! I found a simple switch that in installed between the pfsense boxes and the speedtouch modem/router. Failover works perfect now. So it seems that the switch built into the Speedtouch modems isn't realy suitable for carp! Thanks again for all the help!

jimp

A switch can definitely cause that kind of issue, but it's usually pretty uncommon for a physical switch to do so.

Unfortunately now you've got another single point of failure. :-)

bmaster

@jimp:

Unfortunately now you've got another single point of failure. :-)

Yeah, but there will always be single points of failure I guess. And I prefer to replace a simple and cheap switch that requires no settings, instead of a pc with 4 extra network interfaces and pfsense that has to be configured :-) Besides, that switch is for internet access. We don't really need internet for our business, the other subnets that we'll connect in the future are more important.

jimp

Sounds good, hopefully that's the end of the issue :-)